×
Register Here to Apply for Jobs or Post Jobs. X

Management & Observability Standards Lead

Job in Vacaville, Solano County, California, 95688, USA
Listing for: TheCorporate
Full Time position
Listed on 2026-06-02
Job specializations:
  • IT/Tech
    IT Support, Cybersecurity
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below
Position: Alert Management & Observability Standards Lead

Job Title

Alert Management & Observability Standards Lead

Location

Fairfield, CA

Role Summary

The Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. This role defines alerting standards, reviews and approves alerts before they are routed to the 24x7 Eyes‑on‑Glass Operations team, and establishes a scalable approach to cataloging alert response instructions (runbooks/playbooks) so responders can take consistent, high‑quality actions.

This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance.

Key Responsibilities
  • Alert Rationalization & Prioritization:
    Establish and maintain a department‑wide alert rationalization framework that evaluates alerts for business/service criticality, operational priority, actionability, signal‑to‑noise, ownership, and escalation paths. Perform regular alert reviews to ensure alert quality and correct routing. Lead continuous improvement efforts to reduce alert fatigue while preserving detection of true incidents.
  • Standards, Policies, and Guardrails:
    Define and enforce alerting standards including severity definitions, required metadata, naming conventions, routing rules, and create a standardized Alert Design Checklist and approval workflow. Partner with tool/platform owners to embed standards in monitoring tooling. Act as gatekeeper for routing decisions to 24x7 Eyes‑on‑Glass, on‑call engineering, tickets, or suppression/aggregated dashboards.
  • Routing Alignment:
    Ensure routing aligns with operational responsibilities, department priorities, service ownership, and response instruction cataloging.
  • Response Instruction Cataloging:
    Establish consistent approach to cataloging response instructions for each actionable alert, maintaining runbook templates, ensuring versioning and review cadence. Partner with service owners to keep runbooks current.
  • Reporting & Operational Outcomes:
    Define and publish KPIs (alert volume trends, percentage with runbooks, actionability rate, mean time to acknowledge/triage). Facilitate governance forums, coach service teams on best practices, drive adoption of observability patterns, and support major incident learning.
Required Qualifications
  • 5+ years in IT Operations, SRE, Observability, Monitoring Engineering, or Incident Management with demonstrated success reducing noise and improving actionability.
  • Experience with common monitoring/observability tools such as Splunk, App Dynamics, Dynatrace, Datadog, Prometheus/Grafana, Azure Monitor, Cloud Watch, Service Now Event Management, or similar.
  • Strong understanding of incident response workflows, operational coverage models, CMDB/service ownership, dependency mapping, runbooks, knowledge management, stakeholder management, and ability to drive standards across teams.
Preferred Qualifications
  • Experience designing or operating an Operations Command Center/NOC/SOC‑style “eyes‑on‑glass” model.
  • Familiarity with ITIL Event Management, SRE principles, and service reliability practices.
  • Automation experience for alert enrichment, correlation, and routing.
  • Background in governance frameworks and operating rhythm design.
Competencies / What Great Looks Like
  • Opinionated, data‑driven governance anchored in outcomes.
  • Practical standardization with usable templates and policies.
  • Operational empathy and knowledge of 24x7 responder needs.
  • Strict quality bar: only actionable alerts reach Eyes‑on‑Glass.
  • Continuous improvement mindset and deliverables in the first 45 days, including alerting standards published, intake workflow established, top 20 noisy services rationalized, runbook template launched, and central alert catalog created.
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary