×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer; SRE - Azure | DevSecOps | IaC | Governance | Observability

Job in Rexburg, Madison County, Idaho, 83440, USA
Listing for: Avaya Corporation
Full Time position
Listed on 2026-06-09
Job specializations:
  • IT/Tech
    IT Support, Cloud Computing, SRE/Site Reliability, Systems Engineer
Salary/Wage Range or Industry Benchmark: 129000 - 143000 USD Yearly USD 129000.00 143000.00 YEAR
Job Description & How to Apply Below
Position: Site Reliability Engineer (SRE) - Azure | DevSecOps | IaC | Governance | Observability

Date:
Mar 17, 2026

Location:

Remote, US

Requisition

About Avaya

Avaya is an enterprise software leader that helps the world’s largest organizations and government agencies forge unbreakable connections.

The Avaya Infinity™ platform unifies fragmented customer experiences, connecting the channels, insights, technologies, and workflows that together create enduring customer and employee relationships.

We believe success is built through strong connections – with each other, with our work, and with our mission. At Avaya, you'll find a community that values your contributions and supports your growth every step of the way.

Learn more at

Description

We are seeking a
Site Reliability Engineer (SRE) who will drive stability, reliability, and performance across our
Azure and GCP-based platforms
.
This role blends operational excellence, proactive incident management, and strong collaboration with Dev Ops, Cloud, and Security teams.

The ideal candidate will have hands‑on experience with
multi‑cloud environments (Azure and GCP),
IaC (Terraform/Ansible),
CI/CD (Jenkins/Git Hub Actions), and modern
observability and AI‑Ops systems
. The engineer will also contribute to
governance, cost optimization, and automation strategies that reduce toil and prevent issues before they occur. A key aspect of this role is the ability to perform deep‑Dive troubleshooting of application performance and errors by analyzing logs and traces in platforms like Grafana and Datadog.

This position includes
24×7 support coverage (rotational) and requires strong ownership in managing major incidents, RCA processes, and continuous service improvements.

Key Responsibilities

Reliability & Incident Management

  • Serve as a key member of the 24×7 on‑call rotation, responding to and managing incidents across production and pre‑production environments.
  • Lead incident bridges, coordinate root cause analysis (RCA), and ensure post‑incident reviews drive systemic improvements.
  • Maintain clear communication with cross‑functional teams and leadership during major incidents.

Monitoring, AI‑Ops, Alerts & Prevention

  • Build, tune, and maintain observability dashboards (
    Azure Monitor
    ,
    GCP Operations Suite
    ,
    Prometheus
    ,
    Grafana
    ,
    Datadog
    ,
    Log Analytics
    ).
  • Perform deep‑Dive troubleshooting of application and service‑level issues using distributed tracing and log analysis (Grafana, Datadog) to pinpoint root causes beyond infrastructure.
  • Define
    SLOs, SLIs, and error budgets to proactively identify and mitigate reliability risks before customer impact.
  • Integrate
    AI‑Ops tools for anomaly detection, predictive alerting, and automated incident correlation.
  • Continuously enhance alert quality, reduce false positives, and automate runbooks for faster recovery.
  • Analyze trends to prevent recurring issues and support teams in resilience engineering.
Requirements

Required Skills & Experience

  • 5+ years in
    Site Reliability, Dev Ops, Cloud Operations
    ,
    or Customer support roles.
  • Demonstrated experience in application‑level troubleshooting by analyzing logs and traces to identify bugs, performance bottlenecks, and error conditions.
  • Expertise in
    Azure and GCP cloud operations and distributed system reliability.
  • Understanding of
    Terraform
    ,
    Ansible
    , and
    CI/CD pipelines
    (Jenkins, Git Hub Actions).
  • Experience with
    observability and AI‑Ops tools
    (Azure Monitor, GCP Operations Suite, Grafana, Prometheus, Datadog, etc.).
  • Solid grasp of
    incident management frameworks
    (P1–P3 handling, RCA, PIRs, on‑call rotations).
  • Excellent analytical, troubleshooting, and communication skills.

Desired Behaviours

  • Proactive Prevention: Identifies and resolves risks before they escalate into incidents.
  • AI‑Driven Mindset: Applies AI and automation to improve reliability and reduce human intervention.
  • Accountability: Owns service reliability and communicates with clarity.
  • Collaboration: Works seamlessly with platform, Dev Ops, and product teams.
  • Efficiency: Focuses on automation to reduce manual effort and improve MTTR.
  • Continuous Improvement: Learns from failures, iterates processes, and enhances documentation.

The pay range for this opportunity is from $129,00 to $143,000 + performance‑related bonus + benefits. This range represents the anticipated low and high end…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary