Management & Observability Standards Lead
Listed on 2026-06-02
-
IT/Tech
IT Support, Cybersecurity
Job Title
Alert Management & Observability Standards Lead
LocationFairfield, CA
Role SummaryThe Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. This role defines alerting standards, reviews and approves alerts before they are routed to the 24x7 Eyes‑on‑Glass Operations team, and establishes a scalable approach to cataloging alert response instructions (runbooks/playbooks) so responders can take consistent, high‑quality actions.
This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance.
- Alert Rationalization & Prioritization:
Establish and maintain a department‑wide alert rationalization framework that evaluates alerts for business/service criticality, operational priority, actionability, signal‑to‑noise, ownership, and escalation paths. Perform regular alert reviews to ensure alert quality and correct routing. Lead continuous improvement efforts to reduce alert fatigue while preserving detection of true incidents. - Standards, Policies, and Guardrails:
Define and enforce alerting standards including severity definitions, required metadata, naming conventions, routing rules, and create a standardized Alert Design Checklist and approval workflow. Partner with tool/platform owners to embed standards in monitoring tooling. Act as gatekeeper for routing decisions to 24x7 Eyes‑on‑Glass, on‑call engineering, tickets, or suppression/aggregated dashboards. - Routing Alignment:
Ensure routing aligns with operational responsibilities, department priorities, service ownership, and response instruction cataloging. - Response Instruction Cataloging:
Establish consistent approach to cataloging response instructions for each actionable alert, maintaining runbook templates, ensuring versioning and review cadence. Partner with service owners to keep runbooks current. - Reporting & Operational Outcomes:
Define and publish KPIs (alert volume trends, percentage with runbooks, actionability rate, mean time to acknowledge/triage). Facilitate governance forums, coach service teams on best practices, drive adoption of observability patterns, and support major incident learning.
- 5+ years in IT Operations, SRE, Observability, Monitoring Engineering, or Incident Management with demonstrated success reducing noise and improving actionability.
- Experience with common monitoring/observability tools such as Splunk, App Dynamics, Dynatrace, Datadog, Prometheus/Grafana, Azure Monitor, Cloud Watch, Service Now Event Management, or similar.
- Strong understanding of incident response workflows, operational coverage models, CMDB/service ownership, dependency mapping, runbooks, knowledge management, stakeholder management, and ability to drive standards across teams.
- Experience designing or operating an Operations Command Center/NOC/SOC‑style “eyes‑on‑glass” model.
- Familiarity with ITIL Event Management, SRE principles, and service reliability practices.
- Automation experience for alert enrichment, correlation, and routing.
- Background in governance frameworks and operating rhythm design.
- Opinionated, data‑driven governance anchored in outcomes.
- Practical standardization with usable templates and policies.
- Operational empathy and knowledge of 24x7 responder needs.
- Strict quality bar: only actionable alerts reach Eyes‑on‑Glass.
- Continuous improvement mindset and deliverables in the first 45 days, including alerting standards published, intake workflow established, top 20 noisy services rationalized, runbook template launched, and central alert catalog created.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).