×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Deerfield, Lake County, Illinois, 60063, USA
Listing for: SRM Digital LLC
Full Time position
Listed on 2026-06-06
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

Qualifications

  • 7+ years of experience in Site Reliability Engineering (SRE), Platform Engineering, Cloud Infrastructure Engineering, or related roles within large-scale enterprise environments.
  • Minimum 4+ years of hands‑on experience working primarily within Microsoft Azure cloud environments.
  • Strong expertise in Azure Kubernetes Service (AKS), including cluster lifecycle management, RBAC, network security policies, pod security standards, autoscaling, workload identity, and platform governance.
  • Proven experience building and supporting microservices‑based applications using Java and implementing CI/CD pipelines using Azure Dev Ops (ADO).
  • Hands‑on experience designing, implementing, and operating enterprise‑scale observability solutions using Dynatrace.
  • Strong understanding and practical experience establishing Service Level Objectives (SLOs), Service Level Indicators (SLIs), Error Budgets, and reliability‑focused operational practices.
  • Strong scripting and automation experience using Python, Power Shell, Azure Automation, and cloud‑native tooling.
Reliability Engineering & Platform Ownership
  • Define, establish, and continuously improve enterprise‑wide reliability standards, including SLOs, SLIs, and Error Budgets across business‑critical Azure‑hosted services.
  • Own service reliability metrics and regularly communicate SLA compliance, operational health, and reliability improvements to business and executive stakeholders.
  • Partner with architecture, development, and platform teams to ensure reliability, scalability, and resiliency requirements are embedded throughout the service lifecycle.
  • Conduct architecture and design reviews to ensure availability targets, resilience requirements, and recovery objectives are incorporated from initial design through production deployment.
  • Drive adoption of reliability engineering best practices and champion proactive resilience initiatives including chaos engineering methodologies.
Incident Management & Operational Excellence
  • Lead major incident management activities by serving as Incident Commander for high‑priority production incidents (P1/P2) and driving resolution efforts across cross‑functional teams.
  • Own the end‑to‑end incident lifecycle including detection, escalation, communication, resolution management, and post‑incident reviews.
  • Participate in structured global on‑call rotations and maintain operational response objectives for mission‑critical services.
  • Foster a blameless post‑mortem culture focused on continuous improvement and ensure corrective actions are tracked through completion.
Disaster Recovery & Resiliency
  • Design, implement, and maintain Disaster Recovery (DR) strategies across Azure environments to ensure business continuity and operational resilience.
  • Lead regular disaster recovery exercises, validate recovery processes, and continuously improve recovery readiness across critical workloads.
  • Establish and maintain Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) aligned with business requirements.
Observability & Monitoring
  • Design, build, and operate enterprise observability capabilities using Dynatrace to provide comprehensive visibility across Metrics, Events, Logs, and Traces (MELT).
  • Develop monitoring standards, dashboards, alerting frameworks, and operational reporting to improve service visibility and reduce incident response times.
  • Integrate monitoring and alerting platforms with enterprise tools including Pager Duty and Service Now to enable proactive operations.
  • Build automation frameworks, operational tooling, self‑healing capabilities, and reusable platform services to improve operational efficiency and reduce manual effort.
  • Develop and maintain infrastructure automation, operational runbooks, and platform engineering capabilities using Azure‑native services and scripting technologies.
  • Continuously identify opportunities to improve reliability, scalability, security, and operational efficiency through automation and platform enhancements.
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary