Principal Site Reliability Engineer – AI Job New York New York USA,IT/Tech

Location: New York

About Our Client

Our client is an AI-driven health-tech start-up on a mission to transform patient care through intelligent, secure, and highly reliable clinical automation tools. Their platform powers real-time insights for clinicians, improving patient outcomes and enabling healthcare systems to operate with unprecedented efficiency. They are entering a high-growth phase and are seeking a Principal Site Reliability Engineer to help scale their infrastructure and ensure world-class reliability.

Role Overview

Our client is hiring a Principal Site Reliability Engineer to serve as the technical authority for the reliability, scalability, and performance of their cloud-native infrastructure. This individual will design and implement systems that support rapid product development while meeting the resilience requirements of clinical-grade AI applications. The role blends hands‑on engineering with architectural leadership and cross‑functional collaboration across product, ML, infrastructure, and security teams.

What

You’ll Do

Architect, build, and optimize scalable, secure, and highly available cloud infrastructure (AWS/Google Cloud Platform/Azure).
Lead incident response, root‑cause analysis, and production reliability improvements across the platform.
Implement observability frameworks (metrics, tracing, logging) that provide deep visibility into system performance.
Partner with ML and data engineering teams to operationalize AI/ML pipelines, ensuring reliability from data ingestion through model deployment.
Develop automated CI/CD pipelines, infrastructure‑as‑code, and guardrails for safer, faster deployments.
Define SLOs/SLIs and establish long‑term reliability roadmaps aligned with clinical‑grade requirements.
Mentor SREs and software engineers, promoting Dev Ops and reliability best practices across engineering.
Lead capacity planning, performance testing, and system hardening initiatives.
Collaborate with security teams to ensure compliance with HIPAA, SOC 2, and relevant privacy and security standards.
Evaluate new technologies and drive adoption of tools that improve operational excellence.

What They’re Looking For

8+ years in SRE, Dev Ops, Infrastructure Engineering, or related fields.
Deep expertise with Kubernetes, container orchestration, and microservices architecture.
Strong experience with cloud platforms (AWS/Google Cloud Platform/Azure) and infrastructure‑as‑code tools such as Terraform, Pulumi, or Cloud Formation.
Advanced proficiency in automation/scripting languages such as Python, Go, or Bash.
Strong knowledge of distributed systems, reliability engineering patterns, and modern observability stacks (Prometheus, Grafana, Open Telemetry, Datadog, etc.).
Experience supporting highly regulated or mission‑critical environments (healthcare, fintech, SaaS).
Hands‑on experience with ML infrastructure, model lifecycle management, or data pipelines is a plus.
Excellent communication skills and a proactive, ownership‑oriented mindset.

Why Candidates Love This Role

Mission‑driven work that directly influences patient care and health outcomes.
Ownership of foundational infrastructure in a rapidly scaling AI start‑up.
Competitive compensation, equity, and benefits.
A modern, cloud‑native tech stack with the ability to shape future architecture.
A collaborative and innovative engineering culture.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language