×
Register Here to Apply for Jobs or Post Jobs. X

Alerting & Observability Standards Lead

Job in Fairfield, Solano County, California, 94533, USA
Listing for: Jobs via Dice
Full Time position
Listed on 2026-05-27
Job specializations:
  • IT/Tech
    IT Support, Cybersecurity
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below

Alert Management & Observability Standards Lead

Location:

Fairfield, CA.

Role Summary

The Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. The role defines alerting standards, reviews and approves alerts before they are routed to the 24x7 Eyes‑on‑Glass Operations team, and establishes a scalable approach to cataloging alert response instructions so responders can take consistent, high‑quality actions.

This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance.

Key Responsibilities
  • Alert Rationalization & Prioritization (Core)
    • Establish and maintain a department‑wide alert rationalization framework that evaluates alerts for business/service criticality, operational priority, actionability, signal‑to‑noise ratio, ownership, and escalation paths.
    • Perform regular alert reviews to ensure alert quality, correct routing, and alignment with operational coverage.
    • Lead continuous improvement efforts to reduce alert fatigue while preserving detection of true incidents and high‑impact degradation.
  • Standards, Policies, and Guardrails
    • Define and enforce alerting standards including severity definitions, required metadata, naming conventions, routing rules, and an alert design checklist.
    • Partner with tool/platform owners to embed standards in monitoring tooling through templates, required fields, and automated validation.
  • Routing Decisions to 24x7 Eyes‑on‑Glass
    • Act as gatekeeper for determining which alerts should go to 24x7 Eyes‑on‑Glass, route directly to on‑call engineering, create tickets for business‑hours handling, be suppressed, aggregated, or converted to dashboards.
    • Ensure routing aligns with operational responsibilities, department priorities, and service ownership models.
  • Runbook / Response Instruction Cataloging
    • Establish a consistent approach to cataloging response instructions for every actionable alert, covering symptoms, triage steps, remediation, escalation triggers, and links to dashboards, logs, SOPs, and known issues.
    • Own the runbook template, ensuring runbooks are versioned, maintained, and reviewed on a defined cadence.
  • Reporting & Operational Outcomes
    • Define and publish KPIs such as alert volume trends, percentage of alerts with runbooks, actionability rate, noise reduction, and mean time to acknowledge/triage.
    • Facilitate governance forums with service owners and engineering leads to review alert quality and backlog.
  • Cross‑Functional Enablement
    • Coach service teams on best practices including SLIs/SLOs, alert thresholds, dependency monitoring, incident correlation, and observability patterns.
    • Support major incident learning by feeding post‑incident insights back into the alerting system.
Ideal Profile
  • Empathy toward 24x7 NOC and emergency response environments.
  • Ability to translate technical alert data into business impact language.
  • Comfortable working with pushback – executive backing will be provided.
  • Work arrangement:
    Hybrid – 1–2 days on‑site per week; local candidates preferred in Fairfield, CA (Sacramento area also acceptable).
  • Reporting:
    Direct report to Joe; cross‑functional across all teams.
  • Schedule:

    No 24x7 shift requirements for this role.
  • Equipment:
    Supplier provides laptop; candidate logs via VDI; PG&E laptop may be requested.
  • Key tools in use:
    Comarch OSS, Spectrum OI, Net Brain, NetMRI, Dynatrace, SCOM – Splunk was recently removed.
  • Work split: ~85–90% hands‑on technical, 10–15% governance.
Note
  • Goal:
    Rationalize and reduce alert noise for the 24x7 NOC; establish monitoring standards and thresholds across compute, network, and application layers.
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary