×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Valid8 Financial, Inc.
Full Time position
Listed on 2026-06-28
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing: Infrastructure & Operations
Salary/Wage Range or Industry Benchmark: 130000 - 160000 USD Yearly USD 130000.00 160000.00 YEAR
Job Description & How to Apply Below
Position: Staff Site Reliability Engineer

You will define how mission-critical machine learning and real-time analytics systems operate in production — influencing reliability strategy, deployment standards, and infrastructure architecture across engineering.

This team operates in a highly collaborative, in-person engineering environment in SOMA. Infrastructure, ML, and engineering leaders work side by side to design, build, and operate complex systems in real time. The pace is fast, the feedback loops are tight, and decisions happen quickly.

If you’ve grown from Linux systems to Dev Ops to Staff-level SRE, and you now think in terms of systemic risk, scalability, and long-term reliability strategy — this role gives you direct influence and visibility.

This role is intentionally in-person because:

Reliability decisions happen at architectural depth — not over Slack threads

ML, data, and infrastructure teams collaborate continuously in real time

Post-incident reviews, system design debates, and performance tuning sessions are hands‑on and high impact

You will have direct access to engineering leadership and decision‑makers

The infrastructure you’re operating is mission-critical and evolving quickly

If you value deep technical collaboration, tight feedback loops, and being at the center of high-scale ML systems — this environment is built for that.

What You’ll Own

Production reliability for ML and real-time analytics workloads

CI/CD strategy, deployment automation, and rollback design

Observability frameworks (SLOs, alerting, monitoring, incident response)

Infrastructure-as-Code and Kubernetes environments

Capacity planning and performance optimization

Post‑incident reviews that drive measurable, long‑term reliability improvements

Reliability standards across teams — not just within a single service

You’ll partner directly with engineering and data science teams to ensure ML workloads are production‑ready and reliable by design.

What We’re Looking For

Deep experience operating Linux infrastructure and networking in production environments

Proven impact as a Staff SRE, Senior SRE, or senior‑level Dev Ops/Platform Engineer supporting distributed systems

Experience supporting complex, data‑intensive or ML‑driven systems in production

Strong hands‑on experience with Docker and Kubernetes

Strong scripting ability (Bash and/or Python)

CI/CD ownership experience (Git Hub Actions, ArgoCD, or similar)

Experience with modern observability stacks (Prometheus, Grafana, Datadog, ELK, Open Telemetry)

Ability to debug systemic failures across infrastructure, deployments, and workloads

Clear communicator who works effectively across engineering and data teams

Engineers who have evolved from infrastructure foundations into strategic reliability leaders will thrive here.

These Skills Are a Plus

Experience operating ML platforms at scale (training + inference)

AWS or cloud‑managed services experience

Exposure to data platforms such as Spark, Airflow, or Kafka

Experience in SOC 2 or regulated environments

Why This Opportunity

Staff‑level ownership of mission‑critical ML infrastructure

Direct influence over reliability standards across engineering

High‑visibility role with architectural impact

Collaborative engineering culture designed for speed and depth

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary