Incident Management Lead Job Deerfield area,Illinois USA,IT/Tech

Must Have Technical/Functional Skills

6+ years of IT Service Management experience with a minimum of 3 years in a dedicated Major Incident

Management or Incident Commander role in a large enterprise (Fortune 500 / FTSE 100 equivalent complexity).

ITIL 4 Managing Professional or ITIL 4 Specialist:
High Velocity IT certification

(ITIL 4 Foundation minimum required).

Demonstrable experience managing Azure platform incidents: working knowledge of Azure Monitor,

Azure Service Health, Log Analytics, Application Insights, and Microsoft support escalation paths.

Proven ability to command high-pressure P1 incidents involving 20+ stakeholders across technical and

executive levels simultaneously

Expert-level proficiency in Service Now ITSM, including Incident, Problem, Change modules and

dashboard/report building.

Strong data analysis skills: ability to analyze incident trends, build KPI dashboards, and present

actionable insights to senior leadership.

Roles & Responsibilities

Major Incident Command & Coordination

Serve as the single accountable owner for all P1 and P2 major incidents across on premises

and Azure-hosted services, from initial declaration through resolution and post-incident closure.

Convene and chair live incident bridge calls and virtual war rooms using Microsoft Teams,

coordinating across 10+ internal technical resolver groups, managed service partners,

and Microsoft Azure Support (Unified Support escalations).

Drive swift triage by leveraging Azure Service Health, Resource Health, and Azure Monitor dashboards

to rapidly establish scope, affected services, and blast radius within the first 15 minutes of an incident.

Make and enforce escalation decisions, including engaging Microsoft CSS P1 Severity A support cases

and activating DR runbooks where service restoration via normal means is not achievable within RTO.

Maintain clear, timely, and audience-appropriate stakeholder communications throughout the

incident lifecycle, including CEO/CISO executive briefings for business-critical outages.

Post-Incident Review & Continual Improvement

Facilitate structured blameless Post-Incident Reviews (PIRs) within agreed SLAs (P1: 48 hours.

P2: 5 business days); produce high-quality PIR reports consumed by CTO and Board Technology Committee.

Own the incident action item registry; chair weekly SIP (Service Improvement Plan) reviews to ensure

commitments are delivered on time and to quality.

Identify systemic incident patterns through trend analysis using Service Now and Log Analytics.

collaborate with Problem Management to drive root cause elimination for repeat incidents.

Define, track, and report on enterprise incident management KPIs: MTTD, MTTR, incident recurrence rate

, SLA compliance, and customer impact hours presented to IT leadership in monthly operational reviews.

Process Ownership & ITSM Governance

Own, maintain, and continuously improve the enterprise Major Incident Management process, policy,

playbooks, and runbooks aligned to ITIL 4 and the organizations IT Risk and Contro l Framework.

Define and govern the incident severity classification matrix and escalation decision tree.

ensure consistent adoption across all IT towers and managed service partners.

Maintain and test the enterprise crisis communication framework, including stakeholder

notification trees, bridge protocols, and executive communication templates.

Collaborate with Change Management to ensure CAB processes adequately assess change
- induced incident risk; maintain correlation tracking between changes and incidents.

Azure Operations & Cloud Incident Specifics

Develop and maintain Azure-specific incident playbooks covering platform scenarios:

AKS node/pod failures, Azure SQL failover events, Express Route circuit drops, Azure Active Directory

(Entra ) authentication outages, and Azure region-wide service incidents.

Maintain working relationships with Microsoft TAM (Technical Account Manager) and

Azure Rapid Response team: ensure escalation paths to Microsoft CSS are exercised and SLAs understood.

Monitor Azure Service Health and Microsoft 365 Service Health Dashboard proactively.

initiate pre-emptive incident declarations for advisory/degraded-service notifications affecting business-critical

services.

Participate in Azure Operational Reviews with Cloud Platform and SRE teams to identify observability

gaps, alerting blind spots, and runbook deficiencies before they manifest as major incidents.

Capability Building & Stakeholder Engagement

Design and deliver MIM process training programmes for Level 1/2 Service Desk, resolver groups,
and technology leadership; conduct quarterly simulation exercises (Game Day / Incident Ex).
Act as a subject matter expert in enterprise-wide DR and BCP exercises; validate incident response
readiness across all Azure-hosted Tier-0 services.
Build and manage a network of Incident Coordinators across global IT towers to support follow-the
-sun incident coverage.
Generic Managerial Skills, If any
Define and govern the incident severity classification matrix and…