More jobs:
Major Incident Management; MIM & NOC Lead
Job in
Wilmington, New Castle County, Delaware, 19803, USA
Listed on 2026-06-02
Listing for:
Yochana
Full Time
position Listed on 2026-06-02
Job specializations:
-
IT/Tech
IT Support, Cybersecurity, Systems Engineer, Cloud Computing
Job Description & How to Apply Below
Job Title:
Major Incident Management (MIM) & NOC Lead (10+ Years)
Location:
Wilmington, DE (Day1 Onsite)
Job Type: Full time position
Interview process:
Team Interview
Experience: 10+ years in IT Operations / NOC / Major Incident Management, including leadership ownership.
Role
Summary:
The Major Incident Management & NOC Lead is responsible for end-to-end command and control of the enterprise's 24x7 operational monitoring and incident response. This role leads the MIM and NOC function, drives Major Incident (P1/P2) execution, ensures rapid service restoration, and continuously improves operational maturity through problem management, automation, observability enhancements, and SLA governance.
This role requires a mix of strong incident leadership, technical depth across infrastructure and applications, and people/process management to ensure stability, availability, and performance across critical services.
Key Responsibilities:
A) Major Incident Management (Command & Control)
Own the Major Incident (P1/P2) process from detection to resolution, including war-room leadership, stakeholder updates, and closure.
Act as the Incident Commander and ensure structured triage, containment, workaround, and restoration.
Drive cross-functional coordination (App, Infra, Network, Security, DB, Cloud, Vendor teams) to reduce MTTR.
Ensure high-quality incident communications: executive summaries, impact analysis, ETAs, customer/business comms.
Lead and facilitate Post Incident Reviews (PIR/RCA); ensure actionable corrective/preventive actions (CAPA).
Identify recurring issues and trigger Problem Management with measurable reduction plans.
B) NOC Leadership & Operations
Lead the NOC team responsible for 24x7 monitoring, alert triage, event correlation, escalation, and ticket quality.
Establish/maintain standard operating procedures (SOPs), runbooks, escalation matrices, and on-call models.
Ensure NOC meets SLAs/OLAs, improves alert fidelity, and reduces noise through tuning and automation.
Manage handover governance between shifts; maintain service continuity and operational hygiene.
C) Service Reliability & Continuous Improvement
Drive operational improvements: monitoring coverage, SLO/SLA alignment, incident prevention, and resiliency initiatives.
Partner with Engineering/Platform teams on observability strategy, proactive detection, and reliability patterns.
Track and report operational metrics: MTTD, MTTR, incident volume, re-open rate, SLA compliance, and trends.
Support readiness for audits and compliance: evidence collection, process adherence, and risk mitigation.
D) Stakeholder & Vendor Management
Interface with business stakeholders, service owners, and leadership to provide incident status, risk, and remediation plans.
Manage vendor escalations and ensure timely resolution aligned to contractual SLAs.
E) Managerial / Leadership Skills (Must Have)
Proven experience leading MIM & NOC Operations teams (shift-based or on-call models).
Strong Incident Commander capability: calm under pressure, structured decision-making, priority trade-offs.
Excellent stakeholder management across technical teams and business leadership.
Ability to build and enforce process discipline (ITIL-aligned), while improving speed and quality.
Strong coaching/mentoring: performance management, skill development, hiring support as needed.
Effective communication: concise executive updates, clear action plans, facilitation of PIR/RCA sessions.
Data-driven mindset: uses metrics and trend analysis to drive operational outcomes.
Technical Skills (Must Have):
A) Monitoring / Observability
Hands-on experience with NOC tooling and observability platforms such as:
Splunk / ELK, Datadog, Dynatrace, New Relic, App Dynamics
Prometheus/Grafana, Cloud Watch/Azure Monitor
Strong understanding of event correlation, alert tuning, noise reduction, and dashboarding.
B) Incident / ITSM Platforms
Strong working knowledge of Service Now (Incident, Problem, Change, Knowledge, CMDB) or equivalent ITSM tools.
Experience designing workflows, SLAs/OLAs, routing rules, and automation integrations.
C) Infrastructure & Platform Breadth
Solid understanding across:
Windows/Linux administration basics
Network fundamentals (DNS, DHCP, TCP/IP, routing, load balancers, firewalls)
Compute/virtualization (VMware/Hyper-V) and storage concepts
Databases fundamentals (SQL/Oracle, replication, performance symptoms)
Cloud fundamentals and operational support for AWS/Azure/GCP:
IAM basics, networking (VPC/VNet), scaling, logging/monitoring, common failure patterns.
D) Automation & Scripting (Good to Have / Preferred)
Scripting knowledge:
Power Shell / Python / Bash
Familiarity with automation tools:
Ansible, Terraform, CI/CD operational workflows.
Ability to create/maintain runbook automation and self-healing patterns.
E) Security & Resilience (Preferred)
Awareness of security operations touchpoints: DDoS symptoms, certificate expiries, IAM issues, endpoint/EDR alerts.
Familiarity with BCP/DR processes, failover testing, and resilience design…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×