Associate Principal, Site Reliability Engineer
Listed on 2025-12-27
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
Associate Principal, Site Reliability Engineer
Location:
Chicago, IL or Dallas, TX – Onsite 3 Days per Week
Type:
Direct Hire
Benefits:
Competitive Salary ($ - $170,000) plus 8‑15% target bonus. Comprehensive medical, dental, vision, PTO, paid holidays, 401(k) with match, professional development, collaborative culture, work‑life balance.
Chamberlain Advisors is partnering with a leading equity derivatives clearing organization to hire a highly skilled Senior Site Reliability Engineer (SRE) to support the reliability, availability, and performance of our next‑generation cloud platforms. This role is critical to ensuring our systems operate at scale with high resiliency while enabling development teams to deliver features quickly and safely. The Senior SRE will work closely with software engineering, platform and infrastructure teams to design, build, and operate reliable distributed systems.
Emphasis is on automation, observability, operational excellence, and continuous improvement, blending software engineering with systems and cloud expertise.
- Strong analytical and problem‑solving skills with a systematic approach to troubleshooting
- Ability to succeed in a fast‑paced environment with evolving priorities
- Excellent written and verbal communication skills
- Strong documentation skills with attention to detail
- Self‑starter mindset with the ability to research, learn, and deliver independently
- Collaborative, team‑oriented approach with a focus on shared success
- Ensure the availability, performance, scalability, and reliability of production systems supporting Chamberlain’s cloud‑based platforms
- Partner with software development, operations, and infrastructure teams to design and operate production‑ready services
- Design and implement automation to improve incident response, reduce manual effort, and prevent recurring issues
- Develop, maintain, and continuously improve runbooks and operational documentation for service outages and degradations
- Assess production readiness of services by evaluating reliability, observability, scalability, and operational risk
- Define, implement, and monitor key operational metrics related to system health, performance, and capacity
- Architect, develop, and maintain shared reliability services and tooling used across the organization
- Participate in incident management, root cause analysis, and post‑incident reviews with a focus on long‑term remediation
- Contribute to continuous improvement through retrospectives, technical research, code reviews, and design discussions
- Influence delivery timelines and technical expectations by identifying reliability risks and improvement opportunities
- Mentor junior engineers and share knowledge through documentation and collaborative team engagement
- Support Agile/Scrum delivery by contributing to sprint planning, backlog refinement, and story development
- Bachelor’s degree in Management Information Systems, Computer Science, or a related field
- Minimum of 4+ years of experience in Site Reliability Engineering, Dev Ops, or a related engineering discipline
- Proven experience supporting large‑scale, distributed, production systems
- Experience working in Agile/Scrum environments
- Cloud Platforms:
Public cloud experience with AWS (preferred), Azure, or GCP - Observability & AIOps:
Monitoring, logging, alerting, and predictive analytics using tools such as Splunk, Datadog, App Dynamics, Prometheus, Grafana, Sysdig, or Stack Driver - Programming & Automation:
Proficiency in Python, Java, Go, or Bash for automation and tooling - Containers & Orchestration:
Experience with Kubernetes and container platforms such as Docker, Rancher, or Mesos - Distributed Systems:
Messaging and event‑driven platforms including Kafka, Rabbit
MQ, or ActiveMQ - CI/CD & Dev Ops:
Pipeline and deployment tools such as Jenkins, Harness, Travis CI, AWS Code Build/Code Pipeline, or Appveyor - AI Enablement:
Familiarity using Large Language Models (LLMs) to automate SRE workflows (e.g., scripting, incident analysis, reporting) - Resilience Engineering:
Foundational exposure to Chaos Engineering and fault‑injection tools (e.g., Gremlin, Chaos Monkey, AWS FIS)
O…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).