Secure Reliability Engineering Manager Job Reston area,Virginia USA,IT/Tech

Overview

We are seeking an experienced Secure Reliability Engineering (SRE) Manager to lead the reliability, resilience, and secure operation of a sovereign cloud platform supporting regulated and high-trust workloads. This role is responsible for ensuring that availability, performance, and security are engineered into the platform by design, using Terraform-driven Infrastructure as Code (IaC), cloud-native services, and open-source technologies.

The ideal candidate brings deep technical credibility in cloud reliability engineering, strong people leadership, and a security-first mindset—treating security, compliance, and sovereignty as core reliability requirements, not afterthoughts.

Key Responsibilities

Platform Reliability & Architecture

Own the reliability, availability, and resilience of sovereign cloud platforms supporting regulated workloads across hyperscalers (AWS, Azure, GCP, and sovereign variants)
Design and enforce secure information and failure boundaries, including:
- Network segmentation and fault isolation
- Identity, access, and privilege separation
- Data residency, encryption, and key management controls
- Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets aligned with sovereign and regulatory requirements
- Partner with Security, Architecture, and Compliance teams to ensure reliability designs meet sovereignty, regulatory, and contractual obligations

Infrastructure as Code & Reliability Automation

Lead development and governance of Terraform-based IaC frameworks with reliability and security baked in
Establish reusable modules, standards, and pipelines for:
- Cloud-native services (compute, storage, networking, identity)
- Built-in resilience patterns (multi-zone, multi-region, failover)
- Embedded security and compliance controls
- Provisioning and configuration
- Drift detection and remediation
- Capacity management and lifecycle operations

Secure SRE Operations

Build and operate reliability-focused CI/CD pipelines for infrastructure and platform services
Lead operational practices including:
- Monitoring, logging, tracing, and alerting
- Incident response, root cause analysis, and post-incident reviews
- Change, release, and reliability risk management
- Reduce toil through automation while maintaining strict security and change controls

Security, Compliance & Operational Assurance

Implement security-by-default and resilience-by-design practices across all environments
Ensure operational alignment with frameworks such as:
- Zero Trust architecture
- NIST, ISO, SOC, or equivalent regulatory standards
Support audits and assessments by delivering traceable, code-driven controls, operational evidence, and reliability metrics
Treat compliance gaps, security weaknesses, and reliability risks as production-impacting issues

Cloud-Native & Open-Source Technologies

Govern and operate cloud-native and open-source platforms such as:
- Ensure platforms are secure, observable, resilient, and supportable
- Evaluate emerging technologies that improve reliability, security posture, and operational efficiency

People Leadership & Reliability Culture

Lead, mentor, and grow a team of Secure Reliability Engineers
Establish an SRE culture focused on:
- Blameless incident response
- Strong operational ownership
- Define clear roadmaps, reliability goals, and success metrics aligned with business and sovereign requirements

Required Qualifications

10+ years of experience in SRE, Dev Ops, Cloud Engineering, or Platform Engineering
4+ years of experience leading or managing technical teams
Deep hands-on experience with Terraform in production, regulated environments
Strong experience with at least one major cloud provider (AWS, Azure, GCP)
Proven experience designing highly available, secure, and isolated cloud platforms
Strong understanding of:
- Cloud security fundamentals (IAM, encryption, network security, secrets management)
- Reliability engineering concepts (SLOs, error budgets, incident management)
- Experience with CI/CD, observability, and automation tooling

Preferred Qualifications

Experience supporting sovereign, government, or highly regulated environments
Kubernetes platform reliability experience in security-sensitive contexts
Fam…


Increase/decrease your Search Radius (miles)



Job Posting Language