Lead Site Reliability Engineer
Listed on 2026-05-31
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability
Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.
About Our CompanyAt athenahealth, we deliver high quality and affordable healthcare solutions and drive growth across industries. Our success is powered by our talented team and the strategic leadership of our corporate managers. We foster a collaborative and dynamic environment where employees are encouraged to innovate, grow, and excel in their careers. As part of our team, you’ll be empowered to make a significant impact, lead strategic initiatives, and drive business results.
Position OverviewWe are looking for a Lead Site Reliability Engineer to join our Cloud Engineering division. Cloud Engineering ensures the continuous availability of the technologies and systems that are the foundation of athenahealth’s services. We are directly responsible for thousands of servers, petabytes of storage, and handling thousands of web requests per second, all while sustaining growth at a meteoric rate. We enable an operating system for the medical office that abstracts away administrative complexity, leaving doctors free to practice medicine.
But enough about us; let’s talk about you!
You’re a seasoned engineer with a passion for identifying and resolving reliability and scalability challenges. You are a curious team player, someone who loves to explore, learn, and make things better. You are excited to uncover inefficiencies in business processes, creative in finding ways to automate solutions, and relentless in your pursuit of greatness. You’re a nimble learner capable of quickly absorbing complex solutions and an excellent communicator who can help evangelize engineering excellence.
TheTeam
We are a bunch of Site Reliability Engineers who are passionate about reliability, automation, and scalability. We use an agile based framework to execute our work, ensuring we are always focused on the most important and impactful needs of the business. We support systems in both private and public cloud and make data-driven decisions for which one best suit the needs of the business.
We are relentless in automating away manual, repetitive work so we can focus on projects that help move the business forward.
- Define, measure, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for cloud services and infrastructure components.
- Lead efforts to continuously improve system availability, fault tolerance, and disaster recovery capabilities.
- Ensure proactive incident detection, efficient root cause analysis, and timely resolution of production incidents.
- Participate in a 12x7 on-call rotation. We have a peer team in India that manages the overnight on-call.
- Drive automation efforts to reduce manual intervention and streamline cloud infrastructure management.
- Implement Infrastructure as Code (IaC) using tools like Terraform, AWS Cloud Formation, and Ansible to provision, manage, and scale cloud resources.
- Automate deployment, scaling, and monitoring processes to improve efficiency and reduce operational complexity.
- Design and implement monitoring, logging, and alerting solutions to track cloud infrastructure health, performance, and security.
- Use observability tools (e.g., Prometheus, Grafana, Cloud Watch) to ensure continuous visibility into cloud infrastructure performance and capacity.
- Identify bottlenecks and performance issues, proposing and implementing improvements to ensure optimal resource usage.
- Ensure that cloud infrastructure is built with security best practices in mind and meets all relevant compliance and regulatory requirements.
- Collaborate with security teams to implement security controls and risk mitigation strategies across cloud environments.
- Regularly audit and review cloud infrastructure for security vulnerabilities and compliance gaps.
- Work closely with development, Dev Ops, and operations teams to ensure cloud infrastructure aligns with application and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).