×
Register Here to Apply for Jobs or Post Jobs. X

Sr. Site Reliability Engineer - Incident Response

Job in Phoenix, Maricopa County, Arizona, 85003, USA
Listing for: Cox Communications
Full Time position
Listed on 2025-12-23
Job specializations:
  • IT/Tech
    Systems Engineer, IT Support, SRE/Site Reliability, Cloud Computing
Salary/Wage Range or Industry Benchmark: 99000 USD Yearly USD 99000.00 YEAR
Job Description & How to Apply Below

Company

Cox Automotive - USA

Job Family Group

Engineering / Product Development

Job Profile

Sr Software Engineer

Management Level

Individual Contributor

Flexible Work Option

Hybrid - Ability to work remotely part of the week

Travel %

No

Work Shift

Day

Compensation

Compensation includes a base salary of $99,000.00 - $. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate’s knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program.

Job Description

The Site Reliability Engineer - Incident Response is a critical enterprise-level role responsible for accelerating incident resolution and enhancing the overall incident management process. This individual partners with engineering teams during active incidents to troubleshoot issues using monitoring and logging tools, and post‑incident, delivers executive‑level summaries that clearly communicate impact, root cause, and resolution. The SRE - Incident Response also plays a key role in analyzing incident response effectiveness and identifying opportunities for systemic improvements.

Core

Competencies and Qualifications
  • Bachelor’s degree in a related discipline and 4 years’ experience in a related field. The right candidate could also have a different combination, such as a master’s degree and 2 years’ experience; a Ph.D. and up to 1 year of experience; or 16 years’ experience in a related field.
  • Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in future.
  • Engineering/Tooling:
    Demonstrates the ability to design, build, and maintain engineering solutions and tools that enhance reliability, automate incident response, and reduce operational toil.
  • Incident Troubleshooting:
    Skilled in interpreting logs, metrics, and traces to assist in identifying root causes during live incidents.
  • Monitoring & Observability:
    Proficient in tools such as Datadog, Splunk, New Relic, or similar platforms.
  • Strong programming background in Python, Java, or C#, with experience building, maintaining, and troubleshooting production‑grade services and automation tools.
  • Proven ability to design and implement reliable, scalable, and highly available systems, leveraging software engineering best practices to improve system resilience and operational efficiency.
  • Experience developing automation and tooling to reduce toil, improve incident response, and support continuous improvement across monitoring, deployment, and recovery processes.
  • Ability to collaborate closely with software engineering teams to influence architecture and operational readiness, ensuring reliability is built into the system from design through production.
  • AI Centric Engineering:
    Effectively leverages artificial intelligence (AI) and machine learning (ML) tools to automate, optimize, and enhance daily engineering and incident response tasks.
  • Analytical Rigor:
    Strong attention to detail in validating incident data and identifying trends or gaps in response.
  • Dev Ops & Architecture Knowledge:
    Understanding full‑stack systems, CI/CD pipelines, caching, scaling, and cloud‑native infrastructure.
  • Metrics & Reporting:
    Capable of calculating and interpreting key metrics like MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve).
Responsibilities

Here are the responsibilities of this role when not tied to active on‑call:

Post‑Incident Review Development
  • Draft and deliver executive summaries post‑incident
  • Develop and coach teams on blameless post‐mortems
    .
  • Create templates, train facilitators, and help guide root cause analysis (e.g., 5 Whys, fishbone diagrams).
  • Maintain a central library of learnings and cross‑cutting themes.
Incident Process Improvement
  • Actively support engineering teams during incidents by helping diagnose and resolve issues quickly
  • Navigating and analyzing data from observability platforms to make informed inferences about root causes
  • Analyze the effectiveness of incident response to identify systemic reliability gaps.
  • Standardize incident response…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary