Site Reliability/Platform Engineer; Linux/Kubernetes/Python
Listed on 2026-06-02
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, Network Engineer
Site Reliability Engineer (Kubernetes / Open Shift Platform Engineering)
Location:
Reston, VA
Salary: 180-190K + 10% Bonus
Must have the following: on-prem Kubernetes engineering, Open Shift, Platform Engineering, Observability tools, Incident response, Automation, Production troubleshooting, Linux environments
Responsibilities:
- Maintain the health, stability, and reliability of core technical platforms and platform services supporting business continuity and high availability.
- Improve end-to-end platform observability to ensure system performance, incidents, and trends are proactively identified and addressed.
- Lead incident response efforts, root-cause analysis, and postmortems to continuously improve platform reliability and reduce recurring issues.
- Partner with development teams to troubleshoot deployment, routing, ingress, and configuration issues within Kubernetes/Open Shift environments.
- Build and maintain automated deployment pipelines supporting engineering, development, and data teams.
- Develop, test, and deploy automation solutions that reduce manual intervention and improve operational efficiency.
- Lead the rollout of new platform services, features, and capabilities across hybrid infrastructure environments.
- Operate and support platform services across on-premise infrastructure and Azure cloud services.
- Maintain operational documentation, deployment procedures, incident response plans, and technical runbooks.
- Participate in on-call rotation supporting production environments and critical infrastructure systems.
- Assist with additional technical initiatives and operational responsibilities as needed.
- Bachelor's degree in Computer Science or related field, or equivalent practical experience.
- 4–5+ years of experience in Kubernetes Engineering, Site Reliability Engineering, Platform Engineering, or similar infrastructure-focused roles.
- Strong hands-on Kubernetes engineering experience, including workload management, operators, routing/ingress, cluster administration, and performance management.
- Experience managing and supporting Open Shift environments is highly preferred.
- Experience deploying and supporting platform services and observability tooling.
- Strong troubleshooting skills across logs, metrics, traces, packet captures, and Kubernetes debugging tools.
- Strong understanding of observability platforms and connecting alerts, incidents, and operational trends to actionable outcomes.
- Experience working within regulated or heavily audited environments preferred.
- Strong communication skills with the ability to document technical procedures and operational activities thoroughly.
- Ability to manage multiple priorities in a dynamic, fast-paced environment.
- Strong collaboration skills with the ability to work effectively across engineering and infrastructure teams.
- Experience conducting independent technical research and presenting findings to leadership and peers.
- Proof of eligibility to work in the United States required.
Site Reliability Engineer, Kubernetes Engineer, Open Shift Engineer, Platform Engineer, Dev Ops Engineer, Kubernetes administration, Open Shift platform, cluster management, routing ingress, observability tools, Prometheus, Grafana, Datadog, incident response, production support, infrastructure engineer, automation engineer, CI/CD pipelines, platform reliability, troubleshooting Kubernetes, container orchestration, cloud infrastructure, Azure cloud, Linux systems, platform services, SRE jobs, enterprise infrastructure, root cause analysis, deployment automation, platform operations, production troubleshooting, hybrid infrastructure, site reliability, platform monitoring
Site Reliability Engineer, SRE, Open Shift engineer, Kubernetes engineer, Azure cloud engineer, platform engineer, Dev Ops engineer, observability, Grafana, Prometheus, Datadog, Hashi Corp Vault, Kafka, AMQ, Redis, CI/CD, automation, Bash scripting, Python scripting, cloud infrastructure, hybrid cloud, data center, reliability engineering, incident response, root cause analysis, container platform, cluster management, Azure infrastructure, production support, platform reliability, Dev Ops, monitoring tools, automation engineer, enterprise infrastructure, platform services, site reliability, cloud platform, Open Shift administrator, Kubernetes troubleshooting
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).