Senior Site Reliability Engineer Job San Francisco area,California USA,IT/Tech

We are looking for an experienced Site Reliability Engineer to help scale a data- and ML-heavy platform with reliability, observability, and operational excellence at its core. You’ll work closely with software engineers and data scientists to design, automate, and operate the infrastructure that powers data pipelines, machine learning workloads, and real-time analytics systems. This is a hands-on, high-impact role with broad ownership across the stack and significant influence on how our platform and operations evolve.

Responsibilities

Design, build, and maintain scalable infrastructure to support real-time analytics and ML workloads.
Improve system reliability and performance through automation, observability, and proactive capacity and resilience planning.
Own and evolve CI/CD pipelines, deployment automation, rollback mechanisms, and configuration management.
Implement and maintain monitoring, alerting, and incident response processes (SLOs, runbooks, on-call).
Collaborate closely with engineering and data science teams to drive a culture of reliability, performance, and operational excellence.
Ensure security, compliance, and operational readiness across cloud and on-prem infrastructure.
Lead post-incident reviews and drive continuous improvement initiatives.

Required Qualifications

8+ years of experience in SRE, Dev Ops, or infrastructure engineering roles.
5+ years of experience with datacenter operations and/or system and network administration.
Strong experience with containerization (Docker) and orchestration (Kubernetes).
Deep knowledge of Linux systems, networking, and systems performance tuning.
Solid understanding of Infrastructure as Code (e.g., Terraform, Ansible) and config management.
Strong scripting and coding skills, applying sound engineering principles to IaC and automation (Terraform, Ansible, Bash, Python).
Experience with monitoring and observability stacks (e.g., Prometheus, Grafana, Datadog, ELK, Open Telemetry).
Proficiency with CI/CD tools and pipelines (e.g., Git Hub Actions, ArgoCD or similar).
Proven ability to debug complex, distributed systems and automate robust solutions.
Excellent communication skills and comfort working cross-functionally in fast-moving environments.

Preferred Qualifications

Experience with NVIDIA DGX / POD architectures and related tooling (e.g., Base Command Manager, Mission Control, Run:

AI).
Experience with major cloud providers and managed services (e.g., AWS).
Familiarity with security and compliance for cloud-native infrastructure (e.g., SOC 2 or similar environments).
Experience at high-growth or top-tier tech companies (FAANG or VC-backed).

What You’ll Get

Ownership of mission-critical infrastructure at a company solving real-world enterprise problems.
A front-row seat in a high-performance engineering culture that values quality and velocity.
The opportunity to shape how the platform scales—from deployment strategies to incident management practices.
An environment that emphasizes curiosity, accountability, and meaningful impact.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language