Senior Manager of Cloud Platform & Site Reliability Job San Francisco area,California USA,IT/Tech

Requirements

This role requires someone who can zoom out to set org-level direction while remaining technically credible enough to engage meaningfully in architectural decisions across Kubernetes, multi-cloud infrastructure, and reliability engineering
Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field
Proven experience managing managers and leading multiple high-performing infrastructure, platform, or SRE teams in a fast-paced, high-growth environment
Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE, or similar), cloud infrastructure, and distributed systems, with the ability to engage credibly in architectural and operational decisions
Hands-on background with infrastructure-as-code (e.g., Terraform, Pulumi) and CI/CD tooling (e.g., Git Hub Actions, Git Lab CI, Jenkins); familiarity with Git Ops workflows (e.g., Flux CD, ArgoCD, Helm)
Strong foundation in observability tooling — metrics (Prometheus, Victoria Metrics), logging (Loki, ELK), dashboards (Grafana), tracing (Open Telemetry) — and a track record of raising reliability standards through SLOs, SLIs, and observability-as-code
Experience owning incident management and enterprise SLAs at scale, including executive-level communication during high-severity incidents and rigorous post-incident follow-through
Demonstrated ability to lead complex, multi-stakeholder technical initiatives from scoping through execution, balancing engineering excellence with pragmatic delivery
Strong communication skills with executive presence, capable of representing technical work clearly to both technical and non-technical audiences
No prior machine learning experience required, but should be open to learning about ML infrastructure and model serving
(Desirable) Familiarity with running high-performance AI models and workloads, including troubleshooting ML pipelines from preprocessing through inference and serving
(Desirable) Experience with GPU infrastructure, including fractional GPU provisioning and multi-node model serving (e.g., on H100s or B200s)
(Desirable) Experience with incident management platforms (e.g., incident.io, Pager Duty) and building AI-assisted tooling for incident triage and response
(Desirable) Experience scaling an SRE practice: defining runbook standards, building self-healing automations, and converting high-frequency failure patterns into systematic mitigations

What the job involves

As Senior Manager of Cloud Platform and Site Reliability, you will lead and grow the org responsible for the infrastructure that powers Baseten's machine learning platform
This is a manager-of-managers role: you will lead team leads across our Cloud Platform and Site Reliability Engineering functions, setting the technical direction, defining reliability standards, and building the organizational muscle to scale our infrastructure alongside the product
You will own the end-to-end health of our cloud infrastructure and SRE practice — from coaching your leads through complex incident response and enterprise customer escalations, to shaping the multi-year roadmap for multi-cloud capacity, GPU inference infrastructure, and observability platforms
You operate at the intersection of people, strategy, and systems: you know how to build and develop strong teams, hold a high bar for engineering excellence, and make principled tradeoffs between long-term investment and short-term operational demands
Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering orgs, building a culture of ownership, technical excellence, and continuous improvement
Set the technical direction and roadmap for infrastructure, reliability, and platform engineering at the org level — balancing near-term operational needs with long-term strategic investments
Own the reliability posture of the platform end-to-end, establishing and enforcing org-wide standards for SLOs/SLIs, incident response, observability-as-code, runbooks, and post-incident reviews
Drive cross-functional collaboration with product, engineering, and customer-facing teams to ensure infrastructure capabilities and reliability…