Senior Site Reliability Engineer Job El Segundo area,California USA,IT/Tech

About Varda

Low Earth orbit is open for business. Varda is accelerating the development of commercial space infrastructure, from in-orbit pharmaceutical processing to reliable and economical reentry capsules.

From life-saving pharmaceuticals to more powerful fiber optics, there is a world of products used on Earth today that can only be manufactured in space. Varda is accelerating innovation in the orbital economy by creating both the products and infrastructure needed so space can directly benefit life on Earth. Our mission is to expand the economic bounds of humankind.

Our team is uniquely suited to accomplishing this goal, with leadership and staff comprised of veterans from Space

X, Blue Origin, major pharmaceutical companies and Silicon Valley. Varda was founded in January 2021 by Will Bruey and Delian Asparouhov with significant backing from world class investors including Khosla Ventures, Lux Capital, Founders Fund, Caffeinated Capital, General Catalyst, and Also Capital.

Varda is headquartered in El Segundo, California, where we have offices and a production facility where our vehicles, equipment, and materials are built, integrated, and tested. Varda also has offices in Washington, DC and Huntsville, AL.

Join Varda, and work to create a bustling in-space ecosystem.

About This Role

At Varda Space Industries, we're pushing the boundaries of what's possible in space and materials science - and we're looking for bold engineers to help us get there. As a Senior Site Reliability Engineer, you'll be critical in building, scaling, and maintaining the infrastructure that powers our systems on Earth, in orbit, and everything in between.

We are looking for an experienced engineer with deep working knowledge of Kubernetes and containerized technologies. You are a hands-on operator and builder who applies first-principles thinking to both software delivery (Dev Ops) and production reliability (SRE), and thrives in complex, mission-critical environments.

In this role, you will:

* Solve challenging technical problems across a wide range of modern technologies.

* Apply a software engineering mindset to automate operations and improve system reliability, scalability, and resilience

* Design and build infrastructure that enables rapid development - from cloud-based services to embedded software running on spacecraft.

* Shape Varda's infrastructure strategy and drive operational excellence across containerized and modernized environments.

Responsibilities

* Deploy, maintain, and operate mission-critical applications and infrastructure supporting spacecraft and company-wide systems.

* Build and evolve Infrastructure as Code (IaC) frameworks using tools such as Terraform

* Implement and operate observability systems (metrics, logging, tracing) and actionable alerting.

* Build and maintain CI/CD pipelines to enable safe, repeatable, and rapid deployments.

* Partner with software and hardware engineers to deliver highly operable, reliable, and scalable systems and pipelines, ensuring they have the tools and infrastructure needed for rapid iteration.

* Identify, analyze, and resolve system bottlenecks and reliability risks; perform performance tuning and implement long-term stability improvements.

* Respond to and resolve production incidents; perform root cause analysis and drive corrective actions through blameless postmortems.

* Rotate through the team's on-call schedule to keep critical systems healthy and responsive.

* Must be willing to work extended hours and weekends as needed

* Occasionally travel to customer sites and other Varda locations to troubleshoot, deploy, or test critical infrastructure.

Basic Qualifications

* Bachelor's degree in computer science, engineering, or related STEM field with 5+ years of Site Reliability Engineering experience, or 7+ years of progressive experience in Dev Ops, SRE, or Systems Engineering in lieu of a degree.

* Experience with Infrastructure as Code (IaC) using tools like Terraform to automate server provisioning and configuration management

* Experience operating Kubernetes or similar container orchestration platforms in production environments.

* Experience with Prometheus, Grafana, Influx

DB, or similar technologies.

* Knowledge of software-defined networking (VPC, Subnets, Firewalls, VPNs, etc.)

* Python, Bash, Power Shell (or similar) scripting experience

* Positive and strong communication skills, both written and oral

Preferred Skills and Experience

* Experience in provisioning and managing scalable Azure cloud infrastructure using native tools and best practices

* Experience implementing configuration management, provisioning, and workflow automation solutions via Infrastructure as Code, CI/CD, and Git Ops (e.g., Ansible, Salt, ArgoCD, etc).

* Strong understanding of Linux systems and container runtimes (e.g., containerd, Docker)

* Experience with GPU workloads or high-throughput computing.

* Hands-on experience operating and optimizing High Performance Computing (HPC)…