Site Reliability Engineer; Space Communications
Listed on 2025-12-02
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Overview
Site Reliability Engineer (Space Communications) n to help build and maintain observability infrastructure and ensure the global space communications network operates reliably as we scale ground stations around the world.
Responsibilities- Build and maintain observability stack with tools like Grafana, Prometheus, Loki, Vector, Cloud Watch, Victoria Metrics, etc. for metrics and log ingestion across environments
- Support and improve CI/CD pipelines using Git Lab and ArgoCD, collaborating with development teams on deployment best practices
- Help build and maintain cloud infrastructure using Terraform on AWS, contributing to the scalability and reliability of space communication systems
- Work with senior engineers to establish monitoring strategies, alerting, and incident response procedures
- Deploy and manage Kubernetes applications using Helm charts, focusing on reliability and developer experience
- Collaborate with engineering teams to implement performance monitoring and troubleshooting across microservices
- Support identity and access management integration with Okta and Hashi Corp Vault
- Assist in managing NixOS-based infrastructure for reproducible system configurations
- Participate in incident response efforts and contribute to post-incident reviews and improvements
- 2-4 years of hands-on experience with infrastructure tools and monitoring systems in production environments
- Experience with containerization (Docker, Kubernetes) and basic container orchestration
- Familiarity with CI/CD tools (Git Lab, Jenkins, or similar) and infrastructure as code concepts
- Experience with cloud platforms (AWS preferred) and basic infrastructure automation
- Programming skills in Python or similar language and experience with configuration management
- Startup mentality with ability to work in fast-paced, high-growth environments and take on diverse responsibilities
- Experience with logging and metrics collection for production systems
- Understanding of system reliability principles and interest in learning SRE practices
- Some exposure to observability tools like Vector, Loki, Grafana, Prometheus, or similar monitoring systems
- Experience with Terraform or other infrastructure as code tools
- Familiarity with NixOS or other declarative system configuration approaches
- Basic knowledge of Hashi Corp Vault, Okta, or similar identity/secrets management tools
- Interest in distributed systems and troubleshooting complex technical issues
- Previous startup experience or demonstrated ability to learn quickly and adapt
- Linux system administration experience
- AWS certification or demonstrated cloud platform knowledge
To conform to U.S. Government space technology export regulations, including the International Traffic in Arms Regulations (ITAR) you must be a U.S. citizen, lawful permanent resident of the U.S., protected individual as defined by 8 U.S.C. 1324b(a)(3), or eligible to obtain the required authorizations from the U.S. Department of State.
Northwood is an Equal Opportunity Employer; employment with Northwood is governed on the basis of merit, competence and qualifications and will not be influenced in any manner by race, color, religion, gender, national origin/ethnicity, veteran status, disability status, age, sexual orientation, gender identity, marital status, mental or physical disability or any other legally protected status.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).