×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: United States Digital Space LLC
Full Time position
Listed on 2026-06-28
Job specializations:
  • IT/Tech
    Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Systems Engineer
Salary/Wage Range or Industry Benchmark: 120000 - 150000 USD Yearly USD 120000.00 150000.00 YEAR
Job Description & How to Apply Below

the company | Site Reliability Engineer | San Francisco, CA (Hybrid) | Full-time

the company is a no-code data workflow automation tool that helps operations teams move, transform, and automate their data without writing code. LLMs are a core part of our product — we use them to help users build and reason about their workflows — and they're increasingly part of how we run infrastructure too. We're a small, product-focused team and our infrastructure runs on AWS.

We're looking for an SRE that's passionate about observability and keeping systems healthy and understandable. You'll own our monitoring and alerting infrastructure, drive incident response, and work closely with engineering to make sure we have deep visibility into everything that matters. We expect you to use LLMs heavily in your work — writing runbooks, generating alert configs, drafting postmortems, building dashboards — and we want someone who's already figured out how to make that feel natural.

Responsibilities
  • Observability stack — Prometheus, Grafana, dashboards, alerting, and on-call workflows
  • Incident response and postmortems — building a culture of learning from failures
  • SLIs, SLOs, and error budgets — helping the team make data-driven reliability decisions
  • Monitoring LLM-specific infrastructure: latency, token throughput, model error rates, cost attribution
  • AWS infrastructure across our stack (Lambda, ECS, RDS, Open Search, Cloud Front, etc.)
  • CDK-based IaC and CI/CD pipelines as needed
Qualifications
  • Hands-on experience with Prometheus and Grafana (or similar — Datadog, Honeycomb, etc.)
  • Strong instincts for what to instrument and what good alerting actually looks like
  • Comfort debugging distributed systems across the full stack
  • Experience owning on-call and incident response end to end
  • AWS familiarity and enough IaC experience to get things done (CDK or Terraform)
  • Someone who reaches for an LLM before writing boilerplate from scratch — and knows when not to
Nice to have
  • Experience instrumenting LLM pipelines
  • Familiarity with Type Script/Node.js
  • Startup experience
  • Background in security and compliance
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary