Site Reliability Engineer Job San Francisco area,California USA,IT/Tech

Requirements

Cloud Operations: 4+ years of experience managing production-grade environments in AWS, GCP, or Azure
Orchestration:
Expert-level proficiency with Kubernetes (EKS), including networking, ingress controllers, and service mesh management
Automation:
Strong experience with configuration management and IaC (e.g., Terraform, Ansible, Helm)
Data Systems:
Deep knowledge of SQL and No

SQL database administration, focusing on replication, backup, and disaster recovery
Programming:
Proficiency in Python and C++ for developing internal tooling and automating complex operational workflows
Systems Internals:
Strong understanding of Linux networking, storage, and kernel tuning
(Desirable) Prior experience in Aerospace, Defense, or high-reliability sectors
(Desirable) Familiarity with CCSDS standards or satellite ground station software
(Desirable) Experience with secure, air-gapped, or hybrid-cloud deployments

What the job involves

We are seeking a Site Reliability Engineer (SRE) to architect and manage the critical ground infrastructure for our satellite constellation. This role is responsible for the "last mile" of mission success: ensuring that the software controlling our orbital assets is highly available, scalable, and seamlessly integrated with Mission Operations
You will own the lifecycle of our production environments, from automating deployments via Infrastructure as Code (IaC) to managing the core data systems that track constellation health and user activity
Infrastructure as Code (IaC):
Design and maintain scalable, repeatable cloud infrastructure (AWS) using tools like Terraform or Cloud Formation
Mission Ops Integration:
Build and optimize the interfaces between core data management systems and Mission Operations software, ensuring reliable telemetry and command flows
User & Data Management:
Architect and maintain high-availability identity providers (IdP) and distributed databases to support global user access and real-time data processing
Automated Deployment Pipelines:
Create and manage robust CI/CD pipelines to deploy containerized applications into production with a focus on zero-downtime and rollback capabilities
Observability & Reliability:
Implement comprehensive monitoring, alerting, and logging (e.g., Prometheus, Grafana, ELK) to ensure 99.99% uptime for ground segment services
Scalability Engineering:
Perform capacity planning and performance tuning to handle the high-throughput data requirements of a growing satellite constellation

#J-18808-Ljbffr