×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in Menlo Park, San Mateo County, California, 94025, USA
Listing for: SLAC National Accelerator Laboratory
Full Time position
Listed on 2026-05-22
Job specializations:
  • IT/Tech
    Systems Engineer, Data Engineer, Cloud Computing
Job Description & How to Apply Below
Senior Site Reliability Engineer

Job

6630

Location

SLAC - Menlo Park, CA

Full-Time

Regular

** SLAC Job Postings*
* Join the Data Management (DM) team at the  
** Vera

C. Rubin Observatory** , one of modern astronomy's defining missions. The Rubin Observatory is a new astronomy facility in Chile designed to create a 10-year time-lapse map of the southern sky through the Legacy Survey of Space and Time (LSST).

As part of this team, you'll design, operate, and sustain the systems that process Rubin's data in near real time. LSST will generate 15 TB of raw pixels per night with its 8-meter mirror and 3.2 gigapixel camera, creating one of the most demanding  
** petascale data challenges
** in science.

The Data Management System
¿ s Prompt Processing Framework identifies and distributes Alerts for every astrophysical object that moves, changes, or appears in the sky within minutes of observation. These alerts include potentially hazardous asteroids, supernovae, and entirely new classes of transient phenomena. Your work will directly  
** enable astrophysical discoveries
** by keeping Rubin's alerts flowing.

You will join a distributed team of roughly 80 scientists and engineers building and operating Rubin's petascale data management systems. Our work spans large-scale image processing, distributed databases, and production services. Python is our lingua franca, and we develop our software openly on Git Hub under an open-source license.

** Your role:*
* You will own the reliability and robustness of Rubin Observatory's Prompt Processing Framework, the system responsible for detecting and distributing near-real-time alerts for transient and moving objects in the night sky.  The Prompt Processing Framework runs on Kubernetes, with event-driven scaling using Kubernetes Event-Driven Autoscaling (KEDA) integrated with Redis Streams. It interfaces with Postgre

SQL databases and Kafka to ingest data and publish alerts to the global astronomy community.

** Your responsibilities:*
* + Ensure, through both architecture and practice, the reliable operation of the near-real-time data processing pipeline and timely delivery of alerts to downstream brokers.

+ Design and develop software that reduces operational risk and improves system resilience, scalability, and usability, including addressing failure modes, error handling, and contention in shared resources.

+ Improve system performance and resilience by applying architectural and systems-level optimizations to increase throughput and reduce end-to-end latency.

+ Operate Dev Ops-oriented continuous deployment of services using modern distributed systems tooling and development practices (e.g., Kubernetes, Helm, ArgoCD, Kafka, Redis)

+ Develop monitoring dashboards and alerts for the prompt processing service and work with teammates to design and implement a sustainable on-call rotation that provides coverage during the start of observing hours in Chile (typically 2-5pm Pacific Time), with limited off-hours responsibility.

+ Define KPIs and metrics for observability and accountability of the pipeline.

+ Participate in the collective engineering activities of the team, including performing code reviews, acting as a troubleshooting buddy, participating in design discussions, and writing documentation to effectively capture and communicate architectural and implementation choices.

+ Collaborate with members of the Data Management team to identify opportunities to improve tools, workflows, and operational practices.

+ Share responsibility with the broader team for the overall success of the Data Management system, beyond the Prompt Processing Framework.

** Tech Stack*
* The Prompt Processing Framework is built on a modern, cloud-native foundation. It runs on Kubernetes, with deployments managed via Helm and ArgoCD, and uses event-driven scaling through KEDA and Redis Streams. The system integrates with Postgre

SQL and Kafka to ingest data and distribute alerts, with additional databases including Cassandra and Influx

DB. Our primary development language is Python, and our code is developed openly under an open-source model.

** To be successful in this position you will bring:*
* + Bachelor's…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary