More jobs:
Senior Site Reliability Engineer
Job in
Chicago, Cook County, Illinois, 60290, USA
Listed on 2026-01-05
Listing for:
TAG - The Aspen Group
Full Time
position Listed on 2026-01-05
Job specializations:
-
IT/Tech
Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Senior Site Reliability Engineer (SRE) – TAG – The Aspen Group
TAG is one of the largest and most trusted retail healthcare business support organizations. As a Senior Site Reliability Engineer at TAG, you will be responsible for ensuring the reliability, performance, and scalability of our core systems.
Responsibilities- Design and build highly scalable and resilient systems to support our applications and services, incorporating predictive analytics to anticipate reliability risks.
- Develop and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using machine learning anomaly detection to ensure systems meet reliability targets.
- Drive improvements in system reliability, availability, and performance through proactive measures, automation, and intelligent failure prediction.
- Implement and manage comprehensive monitoring and alerting solutions, integrating with intelligent observability platforms that reduce alert noise and correlate events.
- Develop and maintain dashboards and reporting tools that provide data‑driven insights for actionable troubleshooting recommendations and performance optimization.
- Evaluate and integrate advanced monitoring tools and operational intelligence platforms to enhance observability and root cause identification.
- Lead and participate in incident response efforts, using intelligent log analysis and automated event correlation to speed up troubleshooting and root cause identification.
- Develop and maintain incident management processes incorporating automated decision support systems to improve response times and minimize service disruptions.
- Conduct post‑incident reviews, using automated pattern recognition and trend analysis to identify systemic issues and implement preventive measures.
- Analyze performance metrics and logs, supported by advanced observability tools, to detect bottlenecks and inefficiencies.
- Collaborate with development teams to implement automated profiling and optimization recommendations for code and infrastructure improvements.
- Perform capacity planning using machine learning forecasting models to ensure systems can handle current and future loads.
- Develop and implement automation solutions, including intelligent runbook automation, self‑healing systems, and automated incident triage.
- Identify and drive process improvements by applying machine learning to operational data for continuous optimization.
- Maintain documentation that includes automation and machine learning guidelines for monitoring, incident management, and SRE best practices.
- Work closely with engineering, operations, and product teams to align reliability and monitoring goals, including automation adoption strategies.
- Communicate effectively with stakeholders, providing regular updates on system health, incidents, performance improvements, and data‑driven insights.
- Foster a culture of collaboration, knowledge sharing, and automation best practices within the team and across the organization.
- Bachelor's degree in computer science or a related technical field.
- At least 5 years of experience in Site Reliability Engineering or a similar role.
- Strong proficiency in at least one programming language such as Python, Go, or C#.
- Demonstrated experience applying machine learning and automation to operational workflows such as monitoring, alerting, and incident response.
- Expertise with infrastructure as code tools such as Terraform.
- Proven experience working and monitoring container environments such as Cloud Run and Kubernetes.
- Hands‑on experience using and working within an Azure, AWS, and GCP environment (GCP preferred).
- Strong understanding of networking, distributed systems, and cloud infrastructure.
- Familiarity with intelligent monitoring platforms and operational analytics tools such as Prometheus, Grafana, Open Search, Sentry, Google Cloud Observability.
- Excellent problem‑solving skills and the ability to work independently and as part of a team.
- Experience with incident management, root cause analysis, and automated operational workflows.
Annual pay range: $129,000 – $160,000.
BenefitsA generous benefits package that includes paid time off, health, dental, vision, and 401(k) savings plan with match.
Seniority levelMid‑Senior level
Employment typeFull‑time
Job functionHealth Care Provider
IndustriesHospitals and Health Care
#J-18808-LjbffrPosition Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×