Principal Site Reliability Engineer – AI
Listed on 2026-01-01
-
IT/Tech
Cloud Computing, Systems Engineer
About Our Client
Our client is an AI-driven health-tech start-up on a mission to transform patient care through intelligent, secure, and highly reliable clinical automation tools. Their platform powers real-time insights for clinicians, improving patient outcomes and enabling healthcare systems to operate with unprecedented efficiency. They are entering a high-growth phase and are seeking a Principal Site Reliability Engineer to help scale their infrastructure and ensure world-class reliability.
Role OverviewOur client is hiring a Principal Site Reliability Engineer to serve as the technical authority for the reliability, scalability, and performance of their cloud-native infrastructure. This individual will design and implement systems that support rapid product development while meeting the resilience requirements of clinical-grade AI applications. The role blends hands‑on engineering with architectural leadership and cross‑functional collaboration across product, ML, infrastructure, and security teams.
WhatYou’ll Do
- Architect, build, and optimize scalable, secure, and highly available cloud infrastructure (AWS/Google Cloud Platform/Azure).
- Lead incident response, root‑cause analysis, and production reliability improvements across the platform.
- Implement observability frameworks (metrics, tracing, logging) that provide deep visibility into system performance.
- Partner with ML and data engineering teams to operationalize AI/ML pipelines, ensuring reliability from data ingestion through model deployment.
- Develop automated CI/CD pipelines, infrastructure‑as‑code, and guardrails for safer, faster deployments.
- Define SLOs/SLIs and establish long‑term reliability roadmaps aligned with clinical‑grade requirements.
- Mentor SREs and software engineers, promoting Dev Ops and reliability best practices across engineering.
- Lead capacity planning, performance testing, and system hardening initiatives.
- Collaborate with security teams to ensure compliance with HIPAA, SOC 2, and relevant privacy and security standards.
- Evaluate new technologies and drive adoption of tools that improve operational excellence.
- 8+ years in SRE, Dev Ops, Infrastructure Engineering, or related fields.
- Deep expertise with Kubernetes, container orchestration, and microservices architecture.
- Strong experience with cloud platforms (AWS/Google Cloud Platform/Azure) and infrastructure‑as‑code tools such as Terraform, Pulumi, or Cloud Formation.
- Advanced proficiency in automation/scripting languages such as Python, Go, or Bash.
- Strong knowledge of distributed systems, reliability engineering patterns, and modern observability stacks (Prometheus, Grafana, Open Telemetry, Datadog, etc.).
- Experience supporting highly regulated or mission‑critical environments (healthcare, fintech, SaaS).
- Hands‑on experience with ML infrastructure, model lifecycle management, or data pipelines is a plus.
- Excellent communication skills and a proactive, ownership‑oriented mindset.
- Mission‑driven work that directly influences patient care and health outcomes.
- Ownership of foundational infrastructure in a rapidly scaling AI start‑up.
- Competitive compensation, equity, and benefits.
- A modern, cloud‑native tech stack with the ability to shape future architecture.
- A collaborative and innovative engineering culture.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).