Secure Reliability Engineering Manager
Listed on 2026-02-08
-
IT/Tech
Cybersecurity, Cloud Computing, Systems Engineer, SRE/Site Reliability
Overview
We are seeking an experienced Secure Reliability Engineering (SRE) Manager to lead the reliability, resilience, and secure operation of a sovereign cloud platform supporting regulated and high-trust workloads. This role is responsible for ensuring that availability, performance, and security are engineered into the platform by design, using Terraform-driven Infrastructure as Code (IaC), cloud-native services, and open-source technologies.
The ideal candidate brings deep technical credibility in cloud reliability engineering, strong people leadership, and a security-first mindset—treating security, compliance, and sovereignty as core reliability requirements, not afterthoughts.
Key ResponsibilitiesPlatform Reliability & Architecture
- Own the reliability, availability, and resilience of sovereign cloud platforms supporting regulated workloads across hyperscalers (AWS, Azure, GCP, and sovereign variants)
- Design and enforce secure information and failure boundaries, including:
- Network segmentation and fault isolation
- Identity, access, and privilege separation
- Data residency, encryption, and key management controls
- Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets aligned with sovereign and regulatory requirements
- Partner with Security, Architecture, and Compliance teams to ensure reliability designs meet sovereignty, regulatory, and contractual obligations
Infrastructure as Code & Reliability Automation
- Lead development and governance of Terraform-based IaC frameworks with reliability and security baked in
- Establish reusable modules, standards, and pipelines for:
- Cloud-native services (compute, storage, networking, identity)
- Built-in resilience patterns (multi-zone, multi-region, failover)
- Embedded security and compliance controls
- Provisioning and configuration
- Drift detection and remediation
- Capacity management and lifecycle operations
Secure SRE Operations
- Build and operate reliability-focused CI/CD pipelines for infrastructure and platform services
- Lead operational practices including:
- Monitoring, logging, tracing, and alerting
- Incident response, root cause analysis, and post-incident reviews
- Change, release, and reliability risk management
- Reduce toil through automation while maintaining strict security and change controls
Security, Compliance & Operational Assurance
- Implement security-by-default and resilience-by-design practices across all environments
- Ensure operational alignment with frameworks such as:
- Zero Trust architecture
- NIST, ISO, SOC, or equivalent regulatory standards
- Support audits and assessments by delivering traceable, code-driven controls, operational evidence, and reliability metrics
- Treat compliance gaps, security weaknesses, and reliability risks as production-impacting issues
Cloud-Native & Open-Source Technologies
- Govern and operate cloud-native and open-source platforms such as:
- Ensure platforms are secure, observable, resilient, and supportable
- Evaluate emerging technologies that improve reliability, security posture, and operational efficiency
People Leadership & Reliability Culture
- Lead, mentor, and grow a team of Secure Reliability Engineers
- Establish an SRE culture focused on:
- Blameless incident response
- Strong operational ownership
- Define clear roadmaps, reliability goals, and success metrics aligned with business and sovereign requirements
- 10+ years of experience in SRE, Dev Ops, Cloud Engineering, or Platform Engineering
- 4+ years of experience leading or managing technical teams
- Deep hands-on experience with Terraform in production, regulated environments
- Strong experience with at least one major cloud provider (AWS, Azure, GCP)
- Proven experience designing highly available, secure, and isolated cloud platforms
- Strong understanding of:
- Cloud security fundamentals (IAM, encryption, network security, secrets management)
- Reliability engineering concepts (SLOs, error budgets, incident management)
- Experience with CI/CD, observability, and automation tooling
- Experience supporting sovereign, government, or highly regulated environments
- Kubernetes platform reliability experience in security-sensitive contexts
- Fam…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).