Senior Site Reliability Engineer
Listed on 2025-12-01
-
IT/Tech
Cloud Computing, Systems Engineer, IT Support, SRE/Site Reliability
Get AI-powered advice on this job and more exclusive features.
Who We Are QGenda is redefining healthcare workforce management everywhere care is delivered. We're on a mission to empower the healthcare industry to better onboarding, deploy, and manage their workforce. Over 4,500 healthcare organizations have trusted us to help them make strategic workforce decisions through our unified software platform. With more than 700 employees across the US, we are united in our vision and culture to make a difference for our customers, while enjoying the day‑to‑day.
At QGenda, we value our employees and their contributions toward the success of the business. We strive to create a dynamic work environment that fosters growth, innovation, and collaboration, where employees can be proud of the work they do and the impact it has on the healthcare industry. QGenda is headquartered in Atlanta.
To learn more, visit us at or follow us on Instagram or Linked In.
About Your RoleAs a Senior Site Reliability Engineer, you will work with our Infrastructure and Product Development Teams to design, operate, and scale highly available services on AWS. You’ll lead automation and infrastructure‑as‑code efforts to eliminate toil, standardize configuration, and expand observability across metrics, logs, and traces. You will evaluate and introduce AWS services and tooling that improve reliability, performance, and developer velocity.
This role offers the opportunity to shape our reliability roadmap and make a measurable impact on the resilience and evolution of our technology stack.
- Design, implement, and manage scalable systems that ensure high availability, fault tolerance, and optimal performance.
- Continuously monitor and enhance system health and performance through data analysis and metrics.
- Embed observability (metrics, logs, traces, alerts) with actionable thresholds and up‑to‑date runbooks.
- Eliminate toil by building automation and self‑service tools for common operational workflows.
- Own CI/CD pipelines (build, test, security scans) and enable progressive delivery (blue/green, canary).
- Manage infrastructure as code via Terraform and configuration management with Git‑backed workflows.
- Participate in on‑call; triage, mitigate, and resolve incidents within defined SLAs.
- Lead incident response and blameless post‑incident reviews; document RCAs and drive corrective actions to closure.
- Maintain runbooks/playbooks and regularly perform disaster recovery scenarios.
- Operate and secure AWS environments (IAM, VPC, EC2/ECS, RDS, S3, Lambda, etc.) with a focus on resilience and compliance.
- Optimize cost, performance, and reliability (rightsizing, autoscaling, reservations/savings plans, tagging, spend monitoring, etc.).
- Serve as a technical advisor to engineering teams on infrastructure and operations best practices.
- Mentor peers on SRE practices; promote observability, continuous improvement, and a blameless culture.
- Contribute to roadmaps and capacity planning to align reliability goals with product objectives.
- Availability for off‑hours deployment and upgrades of production systems during release and maintenance windows.
- Strong problem‑solving skills and ability to work effectively under pressure.
- Excellent communication skills for cross‑functional collaboration as well as documentation creation.
- B.S. in Computer Science, Computer Information Systems, or Computer Engineering from a major U.S. university or equivalent industry experience.
- 7+ years of experience as a Dev Ops, SRE or Systems Engineer.
- Advanced proficiency with at least one scripting or programming language.
- Experience with Docker and container orchestration tools such as AWS ECS.
- Hands‑on experience building infrastructure and supporting applications in AWS using services such as Lambda, EC2, ECS, S3, SNS, SQS, RDS, Redshift, and Elasticache.
- Experience with logging, creating dashboards, and alerts using observability tools such as Datadog and Amazon Cloud Watch.
- Strong understanding of…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).