Site Reliability Engineer; Space Communications Job Torrance area,California USA,IT/Tech

Position: Site Reliability Engineer (Space Communications)

Overview

Site Reliability Engineer (Space Communications) n to help build and maintain observability infrastructure and ensure the global space communications network operates reliably as we scale ground stations around the world.

Responsibilities

Build and maintain observability stack with tools like Grafana, Prometheus, Loki, Vector, Cloud Watch, Victoria Metrics, etc. for metrics and log ingestion across environments
Support and improve CI/CD pipelines using Git Lab and ArgoCD, collaborating with development teams on deployment best practices
Help build and maintain cloud infrastructure using Terraform on AWS, contributing to the scalability and reliability of space communication systems
Work with senior engineers to establish monitoring strategies, alerting, and incident response procedures
Deploy and manage Kubernetes applications using Helm charts, focusing on reliability and developer experience
Collaborate with engineering teams to implement performance monitoring and troubleshooting across microservices
Support identity and access management integration with Okta and Hashi Corp Vault
Assist in managing NixOS-based infrastructure for reproducible system configurations
Participate in incident response efforts and contribute to post-incident reviews and improvements

Basic Qualifications

2-4 years of hands-on experience with infrastructure tools and monitoring systems in production environments
Experience with containerization (Docker, Kubernetes) and basic container orchestration
Familiarity with CI/CD tools (Git Lab, Jenkins, or similar) and infrastructure as code concepts
Experience with cloud platforms (AWS preferred) and basic infrastructure automation
Programming skills in Python or similar language and experience with configuration management
Startup mentality with ability to work in fast-paced, high-growth environments and take on diverse responsibilities
Experience with logging and metrics collection for production systems
Understanding of system reliability principles and interest in learning SRE practices

Preferred Qualifications

Some exposure to observability tools like Vector, Loki, Grafana, Prometheus, or similar monitoring systems
Experience with Terraform or other infrastructure as code tools
Familiarity with NixOS or other declarative system configuration approaches
Basic knowledge of Hashi Corp Vault, Okta, or similar identity/secrets management tools
Interest in distributed systems and troubleshooting complex technical issues
Previous startup experience or demonstrated ability to learn quickly and adapt
Linux system administration experience
AWS certification or demonstrated cloud platform knowledge

Additional Information

To conform to U.S. Government space technology export regulations, including the International Traffic in Arms Regulations (ITAR) you must be a U.S. citizen, lawful permanent resident of the U.S., protected individual as defined by 8 U.S.C. 1324b(a)(3), or eligible to obtain the required authorizations from the U.S. Department of State.

Northwood is an Equal Opportunity Employer; employment with Northwood is governed on the basis of merit, competence and qualifications and will not be influenced in any manner by race, color, religion, gender, national origin/ethnicity, veteran status, disability status, age, sexual orientation, gender identity, marital status, mental or physical disability or any other legally protected status.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language