Sr Site Reliability Engineer Job Suffolk area,Virginia USA,IT/Tech

About Commence

At Commence, we’re the start of a new age of data‑centric transformation, elevating health outcomes and powering more efficient processes for patients and programs. We combine quality, data‑driven solutions that fuel answers, technology that advances performance, and clinical expertise that builds trust to create a more efficient path to quality care.

With human‑centered, healthcare‑relevant, and value‑based solutions, we create new possibilities with data. We provide proof beyond the concept and performance beyond the scope with a focus on efficiencies that transform the lives of those we serve. With a culture driven by purpose, straightforward communication and clinical domain expertise, Commence cuts straight to better care.

Responsibilities

Design, implement, and own observability infrastructure including metrics, logging, tracing, and alerting across distributed systems.
Define and enforce SLOs, SLIs, and error budgets in partnership with product and engineering teams.
Lead incident response: triage, coordinate remediation, conduct blameless post‑mortems, and drive systemic fixes.
Build and maintain CI/CD pipelines that support rapid, safe delivery of changes to production.
Collaborate with engineering teams on infrastructure changes; able to read, modify, and contribute to existing infrastructure‑as‑code (Terraform or Cloud Formation).
Design and operate highly available, fault‑tolerant systems—including auto‑scaling, failover, and disaster recovery strategies.
Reduce operational toil through automation; eliminate manual processes before they become habits.
Collaborate with software engineers to establish reliability‑first design patterns and review architectures for operational risk.
Manage Kubernetes or container orchestration environments at scale.
Ensure systems meet compliance and security requirements, particularly those applicable to healthcare data (HIPAA, SOC 2).
Provide technical mentorship and guidance to engineers across the organization on reliability practices.
Participate in on‑call rotation with a commitment to continuously reducing the need for it.

Qualifications

7+ years of experience in SRE, platform engineering, or Dev Ops roles.
Exceptional problem‑solving under pressure—demonstrated track record of diagnosing complex, high‑stakes system failures and building durable solutions.
Deep hands‑on experience with AWS services including EC2, EKS/ECS, Lambda, RDS, S3, Cloud Watch, and related tooling.
Familiarity with infrastructure‑as‑code (Terraform or Cloud Formation)—able to contribute to existing configurations.
Experience designing and operating distributed systems with strict availability and latency requirements.
Proficiency in at least one scripting or systems language (Python, Go, Bash, or similar) for automation and tooling.
Experience with container orchestration (Kubernetes, ECS) in production environments.
Expertise in observability tooling (Open Search, Prometheus/Grafana, or equivalent).
Hands‑on experience with CI/CD platforms (Git Hub Actions, Jenkins, Circle

CI, or similar).
Proven ability to define and operationalize SLOs and error budgets.
Experience with relational and No

SQL databases—performance tuning, replication, and backup strategies.
Strong working knowledge of networking fundamentals: DNS, load balancing, VPCs, TLS.
Excellent communication skills—able to translate technical risk into business impact for non‑engineering stakeholders.

Additional Requirements

AWS Certifications (Solutions Architect, Dev Ops Engineer, or Sys Ops Administrator).
Experience in healthcare technology or other regulated industries (HIPAA, SOC 2, FedRAMP).
Familiarity with chaos engineering practices and tooling.
Experience with data pipeline reliability (ETL/ELT workflows, streaming systems).
Exposure to AI/ML infrastructure and the reliability challenges unique to model serving.
Familiarity with additional cloud platforms (Azure, Google Cloud).
Contributions to open‑source reliability or infrastructure tooling.

Work Environment / Physical Demands

The work environment and physical demands described here are representative of those that must be met by an employee to successfully perform the…