×
Register Here to Apply for Jobs or Post Jobs. X

Azure SRE Manager: Lead Reliability & Automation

Job in Irving, Dallas County, Texas, 75084, USA
Listing for: Paradigm
Full Time position
Listed on 2026-05-28
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

Paradigm is a software company transforming the way that the residential, construction & building product industries operate across the globe. We are looking for a Manager, Site Reliability Engineering to be part of revolutionizing these industries.

We're looking for a hands‑on SRE leader to build and develop a high‑performing team that oversees reliability across our Azure‑based platform. You'll promote modern SRE practices, drive down incident response times, and shape a culture where automation replaces toil and every incident becomes a learning opportunity.

This role combines technical depth with people leadership. You'll design reliability frameworks, lead incident response, coach engineers, and partner with product teams to embed reliability into everything we build. Working closely with the Senior Director of SRE & Cloud Operations, you'll transform reactive operations into proactive, data‑driven service management with increasing use of AI and automation to get there faster.

What You Will Do:
  • Lead and grow a team of site reliability engineers. Provide guidance, mentorship, and career development.
  • Contribute to and mature SRE practices across production services: SLOs, SLIs, error budgets, toil reduction, and blameless post‑mortems that turn incidents into lasting improvements.
  • Oversee the incident management lifecycle end‑to‑end including detection, response, resolution, post‑incident review, and systemic improvement.
  • Design on‑call rotations, runbooks, and escalation procedures that balance service reliability with engineer well‑being and sustainable work practices.
  • Drive measurable reductions in MTTR and MTTD through improved observability, intelligent automation, and predictive monitoring.
  • Build automation to eliminate manual operational work including provisioning, deployment, scaling, self‑healing, and reporting.
  • Implement chaos engineering practices to validate system resilience and surface weaknesses before they cause outages.
  • Partner with engineering and product teams to embed reliability requirements into the development lifecycle, from design through deployment.
  • Collaborate with the observability team to ensure comprehensive instrumentation, smart alerting, and actionable dashboards across all critical services.
  • Measure, report, and advocate for reliability improvements with both technical and executive stakeholders using data to drive investment decisions.
What You Need to Succeed:
  • Bachelor’s degree in Engineering, or a related field or equivalent experience.
  • 7+ years in site reliability engineering, Dev Ops, or infrastructure engineering, with at least 1 year in people management (or demonstrated tech lead experience with direct influence over team processes and career growth).
  • Hands‑on experience running production systems on Azure (including proficiency with key services such as AKS, App Services, Service Bus, Event Grid, and Azure Monitor) or comparable cloud platforms.
  • Proven track record implementing SRE practices with measurable reliability improvements and familiarity with modern observability platforms (Datadog, Prometheus/Grafana, or equivalent). AI‑enhanced observability experience is preferred.
  • Experience leading incident response for high‑severity production issues and running effective post‑mortems.
  • Strong background in automation, infrastructure as code (Terraform, Bicep, or similar), and systematically eliminating manual operational work.
  • Experience with Kubernetes container orchestration with production‑grade operational experience.
  • Ability to automate workflows and build scripts using Python, Bash, Power Shell, or Go.
  • Experience with AI coding assistants and CI/CD systems (Git Hub Actions, Azure Dev Ops, ArgoCD) with automation capabilities is preferred.
  • Knowledge of distributed systems patterns is preferred.
  • Exposure to AIOps platforms or using LLMs for operational automation is preferred.
  • Strong communication with the ability to make complex technical issues clear for both engineers and executives.
  • Data‑driven approach. You use metrics and telemetry to guide decisions, not gut feel.
  • You are collaborative cross‑functionally and build trust and alignment naturally.
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary