Senior Site Reliability Engineer
Job in
San Francisco, San Francisco County, California, 94199, USA
Listed on 2026-02-16
Listing for:
Brahma Consulting Group
Full Time
position Listed on 2026-02-16
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, Cybersecurity
Job Description & How to Apply Below
We are looking for an experienced Site Reliability Engineer to help scale a data- and ML-heavy platform with reliability, observability, and operational excellence at its core. You’ll work closely with software engineers and data scientists to design, automate, and operate the infrastructure that powers data pipelines, machine learning workloads, and real-time analytics systems. This is a hands-on, high-impact role with broad ownership across the stack and significant influence on how our platform and operations evolve.
Responsibilities- Design, build, and maintain scalable infrastructure to support real-time analytics and ML workloads.
- Improve system reliability and performance through automation, observability, and proactive capacity and resilience planning.
- Own and evolve CI/CD pipelines, deployment automation, rollback mechanisms, and configuration management.
- Implement and maintain monitoring, alerting, and incident response processes (SLOs, runbooks, on-call).
- Collaborate closely with engineering and data science teams to drive a culture of reliability, performance, and operational excellence.
- Ensure security, compliance, and operational readiness across cloud and on-prem infrastructure.
- Lead post-incident reviews and drive continuous improvement initiatives.
- 8+ years of experience in SRE, Dev Ops, or infrastructure engineering roles.
- 5+ years of experience with datacenter operations and/or system and network administration.
- Strong experience with containerization (Docker) and orchestration (Kubernetes).
- Deep knowledge of Linux systems, networking, and systems performance tuning.
- Solid understanding of Infrastructure as Code (e.g., Terraform, Ansible) and config management.
- Strong scripting and coding skills, applying sound engineering principles to IaC and automation (Terraform, Ansible, Bash, Python).
- Experience with monitoring and observability stacks (e.g., Prometheus, Grafana, Datadog, ELK, Open Telemetry).
- Proficiency with CI/CD tools and pipelines (e.g., Git Hub Actions, ArgoCD or similar).
- Proven ability to debug complex, distributed systems and automate robust solutions.
- Excellent communication skills and comfort working cross-functionally in fast-moving environments.
- Experience with NVIDIA DGX / POD architectures and related tooling (e.g., Base Command Manager, Mission Control, Run:
AI). - Experience with major cloud providers and managed services (e.g., AWS).
- Familiarity with security and compliance for cloud-native infrastructure (e.g., SOC 2 or similar environments).
- Experience at high-growth or top-tier tech companies (FAANG or VC-backed).
- Ownership of mission-critical infrastructure at a company solving real-world enterprise problems.
- A front-row seat in a high-performance engineering culture that values quality and velocity.
- The opportunity to shape how the platform scales—from deployment strategies to incident management practices.
- An environment that emphasizes curiosity, accountability, and meaningful impact.
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×