Senior Kubernetes Engineer - Scientific & Agentic Workflow Platforms
Listed on 2026-06-03
-
Software Development
AI Engineer, Data Scientist
About The Role
Do you want your Kubernetes clusters to do more than serve web traffic? At SLAC, our infrastructure powers the discovery of new materials, the mapping of the universe, and the understanding of fundamental physics.
The Application and User Services (AUS) group within the Scientific Computing Services Division manages the platforms that underpin science build and operate the systems that let researchers focus on discovery rather than infrastructure. We are now seeking a Senior Kubernetes Engineer to help design and implement a scalable, next‑generation platform purpose‑built for scientific and agentic workflows.
This role is not just about managing pods and nodes—it is about building the computational engines that allow scientists to peer into atomic structure, catalog billions of galaxies, and increasingly, to deploy intelligent autonomous agents that drive the next generation of experimental science. You will stand at the intersection of cloud‑native engineering and Nobel‑prize caliber research, collaborating within SLAC and across the broader Department of Energy (DOE) complex, Stanford University, and partner institutions worldwide.
Scientific experiments like the Vera
C. Rubin Observatory and LCLS generate data at rates that challenge the limits of modern infrastructure. AI‑driven agentic workflows—pipelines where autonomous agents orchestrate complex, multi‑step scientific analyses—are rapidly becoming a core part of how experiments are designed, run, and interpreted. You will help us build and maintain the platform that makes all of this possible.
- Design, build, and operate highly available Kubernetes‑based platforms optimized for scientific and agentic workloads
- Architect scalable solutions for high‑throughput data pipelines, real‑time streaming, and batch scientific computing
- Design and implement platform primitives for agentic workflow orchestration—enabling autonomous, multi‑step AI‑driven pipelines that support experimental science
- Develop cloud‑native architectures supporting on‑premises, hybrid cloud, and multi‑cluster deployments
- Build and maintain Infrastructure‑as‑Code using tools such as Helm, Kustomize, and Git Ops workflows
- Evaluate and introduce new technologies and patterns that advance the platform's capabilities for the scientific community
- Lead platform design for agentic scientific workflows—systems where AI agents autonomously orchestrate data acquisition, analysis, and experimental feedback loops
- Collaborate with researchers and data scientists to define platform requirements for running large language model‑driven and reinforcement learning agents at scale
- Implement infrastructure patterns for agent orchestration frameworks (e.g., multi‑agent pipelines, tool‑use APIs, memory and state management) within Kubernetes
- Ensure the platform supports the latency, throughput, and accelerator requirements of agentic workloads
- Build guardrails, observability, and governance tooling suited to autonomous scientific agents operating on sensitive experimental data
- Partner with scientists and researchers—at SLAC and across DOE labs and universities—to design and implement solutions for major scientific programs, including:
- Vera
C. Rubin Observatory / LSST:
Petabyte‑scale nightly sky surveys requiring real‑time alert pipelines and long‑running batch analysis for dark matter and dark energy research - LCLS (Linac Coherent Light Source):
Real‑time analysis infrastructure for the world's brightest X‑ray laser, capturing femtosecond‑scale dynamics of matter - Cryo‑EM:
High‑throughput 3D reconstruction pipelines for structural biology at near‑atomic resolution - Accelerator Operations:
Monitoring, control, and data acquisition infrastructure for particle accelerators - American Science Cloud:
National‑scale scientific data infrastructure to democratize access to computing resources across National Laboratories - Emerging Initiatives:
Co‑design of infrastructure for next‑generation scientific computing programs not yet fully defined - Support the full project lifecycle—from initial technical…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).