Site Reliability Engineer Job McLean area,Virginia USA,IT/Tech

Overview

As a Site Reliability Engineer (SRE), you will help design, build, and operate reliable, secure, and observable cloud‑native systems that support mission‑critical applications and services. You will blend software engineering, Dev Ops practices, and infrastructure expertise to improve system reliability, performance, and operational excellence across the platform.

Contributions

Responsibilities

Establishing development tools and infrastructure for automation.
Understanding the needs of stakeholders and conveying this to developers.
Automate and improve development, testing, deployment, and release processes.
Testing and examining code written by others and analyzing results.
Own and improve the reliability, availability, and performance of production systems and services.
Define, implement, and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
Perform capacity planning, scalability analysis, and performance tuning for applications and infrastructure.
Participate in on‑call rotations, incident response, and post‑incident reviews to drive long‑term improvements.
Design and implement infrastructure‑as‑code (IaC) to provision and manage cloud resources (e.g., AWS, Azure, GCP).
Build and maintain CI/CD pipelines to ensure reliable, repeatable delivery of application and infrastructure changes.
Engineer resilient architectures using concepts such as auto‑scaling, blue/green deployments, canary releases, and self‑healing patterns.
Collaborate with security and platform teams to ensure infrastructure adheres to compliance, security, and governance requirements.
Collaborate with application development teams to design reliable, observable, and operable services from the outset.
Contribute to application code, tooling, and frameworks that enhance reliability, resilience, and performance.
Act as an individual contributor and mentor more junior team members.
Present regular status updates and provide cross‑training to other Dev Ops team members.

Qualifications

Required

Ability to obtain a U.S. government Security Clearance.
BS Degree in an IT field with 10 years of experience OR BS in a non‑IT field and 12 years of related IT experience.
3 years of experience with one or more clouds (i.e. AWS, Azure, or GCP).
3 years of experience with Git SCM providers such as Git Hub, Git Lab, Bitbucket.
3 years of experience with at least one programming language (e.g., Python, Go, Java, or JavaScript) for tooling, automation, or application development.
Hands‑on experience working with AWS in production environments.
Hands‑on experience designing, deploying, and operating Kubernetes‑based systems (e.g., EKS, AKS, GKE).
Experience with Dev Ops practices and tools, including CI/CD pipelines (e.g., Git Hub Actions, Git Lab CI, Jenkins, Azure Dev Ops).
Hands‑on experience with infrastructure‑as‑code tools (e.g., Terraform, Cloud Formation, Pulumi) to manage cloud resources.
Experience configuring and managing containerization and orchestration platforms.
Experience implementing monitoring, logging, and tracing solutions (e.g., Cloud Watch, Prometheus, Grafana, Datadog, New Relic, Elastic, Open Telemetry).
Familiarity with networking fundamentals (DNS, load balancing, routing, TLS) and their impact on reliability and performance.
Experience with incident management, on‑call operations, and production support practices.
Certification(s) such as:
- Cloud certifications (e.g., AWS Dev Ops Engineer, AWS Sys Ops Administrator, Azure Administrator/Dev Ops Engineer, GCP Professional Cloud Dev Ops Engineer).
- Kubernetes certifications (e.g., CKA, CKAD).

Preferred

Hands‑on experience with Drupal and Azure.
Experience implementing Automated Testing frameworks including Selenium.
Excellent written and verbal communication skills, interpersonal and collaborative skills.
Experience documenting an as‑is state of the environment, perform a gap analysis, and produce artifacts that articulate options and recommendations.
Experience designing and implementing SLOs, SLIs, and error budgets in production environments.
Experience with chaos engineering, game days, and resilience testing.
Local to Washington, DC metro area and…