Associate Principal,Site Reliability Engineer Job Chicago area,Illinois USA,IT/Tech

Associate Principal, Site Reliability Engineer

Location:

Chicago, IL or Dallas, TX – Onsite 3 Days per Week

Type:
Direct Hire

Benefits:
Competitive Salary ($ - $170,000) plus 8‑15% target bonus. Comprehensive medical, dental, vision, PTO, paid holidays, 401(k) with match, professional development, collaborative culture, work‑life balance.

Summary

Chamberlain Advisors is partnering with a leading equity derivatives clearing organization to hire a highly skilled Senior Site Reliability Engineer (SRE) to support the reliability, availability, and performance of our next‑generation cloud platforms. This role is critical to ensuring our systems operate at scale with high resiliency while enabling development teams to deliver features quickly and safely. The Senior SRE will work closely with software engineering, platform and infrastructure teams to design, build, and operate reliable distributed systems.

Emphasis is on automation, observability, operational excellence, and continuous improvement, blending software engineering with systems and cloud expertise.

Ideal Candidate

Strong analytical and problem‑solving skills with a systematic approach to troubleshooting
Ability to succeed in a fast‑paced environment with evolving priorities
Excellent written and verbal communication skills
Strong documentation skills with attention to detail
Self‑starter mindset with the ability to research, learn, and deliver independently
Collaborative, team‑oriented approach with a focus on shared success

Accountable For

Ensure the availability, performance, scalability, and reliability of production systems supporting Chamberlain’s cloud‑based platforms
Partner with software development, operations, and infrastructure teams to design and operate production‑ready services
Design and implement automation to improve incident response, reduce manual effort, and prevent recurring issues
Develop, maintain, and continuously improve runbooks and operational documentation for service outages and degradations
Assess production readiness of services by evaluating reliability, observability, scalability, and operational risk
Define, implement, and monitor key operational metrics related to system health, performance, and capacity
Architect, develop, and maintain shared reliability services and tooling used across the organization
Participate in incident management, root cause analysis, and post‑incident reviews with a focus on long‑term remediation
Contribute to continuous improvement through retrospectives, technical research, code reviews, and design discussions
Influence delivery timelines and technical expectations by identifying reliability risks and improvement opportunities
Mentor junior engineers and share knowledge through documentation and collaborative team engagement
Support Agile/Scrum delivery by contributing to sprint planning, backlog refinement, and story development

Qualifications

Bachelor’s degree in Management Information Systems, Computer Science, or a related field
Minimum of 4+ years of experience in Site Reliability Engineering, Dev Ops, or a related engineering discipline
Proven experience supporting large‑scale, distributed, production systems
Experience working in Agile/Scrum environments
Cloud Platforms:
Public cloud experience with AWS (preferred), Azure, or GCP
Observability & AIOps:
Monitoring, logging, alerting, and predictive analytics using tools such as Splunk, Datadog, App Dynamics, Prometheus, Grafana, Sysdig, or Stack Driver
Programming & Automation:
Proficiency in Python, Java, Go, or Bash for automation and tooling
Containers & Orchestration:
Experience with Kubernetes and container platforms such as Docker, Rancher, or Mesos
Distributed Systems:
Messaging and event‑driven platforms including Kafka, Rabbit

MQ, or ActiveMQ
CI/CD & Dev Ops:
Pipeline and deployment tools such as Jenkins, Harness, Travis CI, AWS Code Build/Code Pipeline, or Appveyor
AI Enablement:
Familiarity using Large Language Models (LLMs) to automate SRE workflows (e.g., scripting, incident analysis, reporting)
Resilience Engineering:
Foundational exposure to Chaos Engineering and fault‑injection tools (e.g., Gremlin, Chaos Monkey, AWS FIS)

About Our Client

O…


Increase/decrease your Search Radius (miles)



Job Posting Language