Major Incident Management; MIM & NOC Lead Job Wilmington area,Delaware USA,IT/Tech

Position: Major Incident Management (MIM) & NOC Lead
Job Title:
Major Incident Management (MIM) & NOC Lead (10+ Years)

Location:
Wilmington, DE (Day1 Onsite)

Job Type: Full time position

Interview process:
Team Interview

Experience: 10+ years in IT Operations / NOC / Major Incident Management, including leadership ownership.

Role

Summary:

The Major Incident Management & NOC Lead is responsible for end-to-end command and control of the enterprise's 24x7 operational monitoring and incident response. This role leads the MIM and NOC function, drives Major Incident (P1/P2) execution, ensures rapid service restoration, and continuously improves operational maturity through problem management, automation, observability enhancements, and SLA governance.

This role requires a mix of strong incident leadership, technical depth across infrastructure and applications, and people/process management to ensure stability, availability, and performance across critical services.

Key Responsibilities:

A) Major Incident Management (Command & Control)

Own the Major Incident (P1/P2) process from detection to resolution, including war-room leadership, stakeholder updates, and closure.

Act as the Incident Commander and ensure structured triage, containment, workaround, and restoration.

Drive cross-functional coordination (App, Infra, Network, Security, DB, Cloud, Vendor teams) to reduce MTTR.

Ensure high-quality incident communications: executive summaries, impact analysis, ETAs, customer/business comms.

Lead and facilitate Post Incident Reviews (PIR/RCA); ensure actionable corrective/preventive actions (CAPA).

Identify recurring issues and trigger Problem Management with measurable reduction plans.

B) NOC Leadership & Operations

Lead the NOC team responsible for 24x7 monitoring, alert triage, event correlation, escalation, and ticket quality.

Establish/maintain standard operating procedures (SOPs), runbooks, escalation matrices, and on-call models.

Ensure NOC meets SLAs/OLAs, improves alert fidelity, and reduces noise through tuning and automation.

Manage handover governance between shifts; maintain service continuity and operational hygiene.

C) Service Reliability & Continuous Improvement

Drive operational improvements: monitoring coverage, SLO/SLA alignment, incident prevention, and resiliency initiatives.

Partner with Engineering/Platform teams on observability strategy, proactive detection, and reliability patterns.

Track and report operational metrics: MTTD, MTTR, incident volume, re-open rate, SLA compliance, and trends.

Support readiness for audits and compliance: evidence collection, process adherence, and risk mitigation.

D) Stakeholder & Vendor Management

Interface with business stakeholders, service owners, and leadership to provide incident status, risk, and remediation plans.

Manage vendor escalations and ensure timely resolution aligned to contractual SLAs.

E) Managerial / Leadership Skills (Must Have)

Proven experience leading MIM & NOC Operations teams (shift-based or on-call models).

Strong Incident Commander capability: calm under pressure, structured decision-making, priority trade-offs.

Excellent stakeholder management across technical teams and business leadership.

Ability to build and enforce process discipline (ITIL-aligned), while improving speed and quality.

Strong coaching/mentoring: performance management, skill development, hiring support as needed.

Effective communication: concise executive updates, clear action plans, facilitation of PIR/RCA sessions.

Data-driven mindset: uses metrics and trend analysis to drive operational outcomes.

Technical Skills (Must Have):

A) Monitoring / Observability

Hands-on experience with NOC tooling and observability platforms such as:

Splunk / ELK, Datadog, Dynatrace, New Relic, App Dynamics

Prometheus/Grafana, Cloud Watch/Azure Monitor

Strong understanding of event correlation, alert tuning, noise reduction, and dashboarding.

B) Incident / ITSM Platforms

Strong working knowledge of Service Now (Incident, Problem, Change, Knowledge, CMDB) or equivalent ITSM tools.

Experience designing workflows, SLAs/OLAs, routing rules, and automation integrations.

C) Infrastructure & Platform Breadth

Solid understanding across:

Windows/Linux administration basics

Network fundamentals (DNS, DHCP, TCP/IP, routing, load balancers, firewalls)

Compute/virtualization (VMware/Hyper-V) and storage concepts

Databases fundamentals (SQL/Oracle, replication, performance symptoms)

Cloud fundamentals and operational support for AWS/Azure/GCP:

IAM basics, networking (VPC/VNet), scaling, logging/monitoring, common failure patterns.

D) Automation & Scripting (Good to Have / Preferred)

Scripting knowledge:
Power Shell / Python / Bash

Familiarity with automation tools:
Ansible, Terraform, CI/CD operational workflows.

Ability to create/maintain runbook automation and self-healing patterns.

E) Security & Resilience (Preferred)

Awareness of security operations touchpoints: DDoS symptoms, certificate expiries, IAM issues, endpoint/EDR alerts.

Familiarity with BCP/DR processes, failover testing, and resilience design…