More jobs:
Senior Manager of Cloud Platform & Site Reliability
Job in
San Francisco, San Francisco County, California, 94199, USA
Listed on 2026-06-17
Listing for:
Baseten
Full Time
position Listed on 2026-06-17
Job specializations:
-
IT/Tech
Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Systems Engineer, IT Project Manager
Job Description & How to Apply Below
Requirements
- This role requires someone who can zoom out to set org-level direction while remaining technically credible enough to engage meaningfully in architectural decisions across Kubernetes, multi-cloud infrastructure, and reliability engineering
- Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field
- Proven experience managing managers and leading multiple high-performing infrastructure, platform, or SRE teams in a fast-paced, high-growth environment
- Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE, or similar), cloud infrastructure, and distributed systems, with the ability to engage credibly in architectural and operational decisions
- Hands-on background with infrastructure-as-code (e.g., Terraform, Pulumi) and CI/CD tooling (e.g., Git Hub Actions, Git Lab CI, Jenkins); familiarity with Git Ops workflows (e.g., Flux CD, ArgoCD, Helm)
- Strong foundation in observability tooling — metrics (Prometheus, Victoria Metrics), logging (Loki, ELK), dashboards (Grafana), tracing (Open Telemetry) — and a track record of raising reliability standards through SLOs, SLIs, and observability-as-code
- Experience owning incident management and enterprise SLAs at scale, including executive-level communication during high-severity incidents and rigorous post-incident follow-through
- Demonstrated ability to lead complex, multi-stakeholder technical initiatives from scoping through execution, balancing engineering excellence with pragmatic delivery
- Strong communication skills with executive presence, capable of representing technical work clearly to both technical and non-technical audiences
- No prior machine learning experience required, but should be open to learning about ML infrastructure and model serving
- (Desirable) Familiarity with running high-performance AI models and workloads, including troubleshooting ML pipelines from preprocessing through inference and serving
- (Desirable) Experience with GPU infrastructure, including fractional GPU provisioning and multi-node model serving (e.g., on H100s or B200s)
- (Desirable) Experience with incident management platforms (e.g., incident.io, Pager Duty) and building AI-assisted tooling for incident triage and response
- (Desirable) Experience scaling an SRE practice: defining runbook standards, building self-healing automations, and converting high-frequency failure patterns into systematic mitigations
- As Senior Manager of Cloud Platform and Site Reliability, you will lead and grow the org responsible for the infrastructure that powers Baseten's machine learning platform
- This is a manager-of-managers role: you will lead team leads across our Cloud Platform and Site Reliability Engineering functions, setting the technical direction, defining reliability standards, and building the organizational muscle to scale our infrastructure alongside the product
- You will own the end-to-end health of our cloud infrastructure and SRE practice — from coaching your leads through complex incident response and enterprise customer escalations, to shaping the multi-year roadmap for multi-cloud capacity, GPU inference infrastructure, and observability platforms
- You operate at the intersection of people, strategy, and systems: you know how to build and develop strong teams, hold a high bar for engineering excellence, and make principled tradeoffs between long-term investment and short-term operational demands
- Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering orgs, building a culture of ownership, technical excellence, and continuous improvement
- Set the technical direction and roadmap for infrastructure, reliability, and platform engineering at the org level — balancing near-term operational needs with long-term strategic investments
- Own the reliability posture of the platform end-to-end, establishing and enforcing org-wide standards for SLOs/SLIs, incident response, observability-as-code, runbooks, and post-incident reviews
- Drive cross-functional collaboration with product, engineering, and customer-facing teams to ensure infrastructure capabilities and reliability…
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×