More jobs:
Site Reliability Engineer
Job in
Deerfield, Lake County, Illinois, 60063, USA
Listed on 2026-06-06
Listing for:
SRM Digital LLC
Full Time
position Listed on 2026-06-06
Job specializations:
-
IT/Tech
SRE/Site Reliability, Systems Engineer, Cloud Computing
Job Description & How to Apply Below
Qualifications
- 7+ years of experience in Site Reliability Engineering (SRE), Platform Engineering, Cloud Infrastructure Engineering, or related roles within large-scale enterprise environments.
- Minimum 4+ years of hands‑on experience working primarily within Microsoft Azure cloud environments.
- Strong expertise in Azure Kubernetes Service (AKS), including cluster lifecycle management, RBAC, network security policies, pod security standards, autoscaling, workload identity, and platform governance.
- Proven experience building and supporting microservices‑based applications using Java and implementing CI/CD pipelines using Azure Dev Ops (ADO).
- Hands‑on experience designing, implementing, and operating enterprise‑scale observability solutions using Dynatrace.
- Strong understanding and practical experience establishing Service Level Objectives (SLOs), Service Level Indicators (SLIs), Error Budgets, and reliability‑focused operational practices.
- Strong scripting and automation experience using Python, Power Shell, Azure Automation, and cloud‑native tooling.
- Define, establish, and continuously improve enterprise‑wide reliability standards, including SLOs, SLIs, and Error Budgets across business‑critical Azure‑hosted services.
- Own service reliability metrics and regularly communicate SLA compliance, operational health, and reliability improvements to business and executive stakeholders.
- Partner with architecture, development, and platform teams to ensure reliability, scalability, and resiliency requirements are embedded throughout the service lifecycle.
- Conduct architecture and design reviews to ensure availability targets, resilience requirements, and recovery objectives are incorporated from initial design through production deployment.
- Drive adoption of reliability engineering best practices and champion proactive resilience initiatives including chaos engineering methodologies.
- Lead major incident management activities by serving as Incident Commander for high‑priority production incidents (P1/P2) and driving resolution efforts across cross‑functional teams.
- Own the end‑to‑end incident lifecycle including detection, escalation, communication, resolution management, and post‑incident reviews.
- Participate in structured global on‑call rotations and maintain operational response objectives for mission‑critical services.
- Foster a blameless post‑mortem culture focused on continuous improvement and ensure corrective actions are tracked through completion.
- Design, implement, and maintain Disaster Recovery (DR) strategies across Azure environments to ensure business continuity and operational resilience.
- Lead regular disaster recovery exercises, validate recovery processes, and continuously improve recovery readiness across critical workloads.
- Establish and maintain Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) aligned with business requirements.
- Design, build, and operate enterprise observability capabilities using Dynatrace to provide comprehensive visibility across Metrics, Events, Logs, and Traces (MELT).
- Develop monitoring standards, dashboards, alerting frameworks, and operational reporting to improve service visibility and reduce incident response times.
- Integrate monitoring and alerting platforms with enterprise tools including Pager Duty and Service Now to enable proactive operations.
- Build automation frameworks, operational tooling, self‑healing capabilities, and reusable platform services to improve operational efficiency and reduce manual effort.
- Develop and maintain infrastructure automation, operational runbooks, and platform engineering capabilities using Azure‑native services and scripting technologies.
- Continuously identify opportunities to improve reliability, scalability, security, and operational efficiency through automation and platform enhancements.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×