Principal Cloud and Production Operations Engineer
Listed on 2026-06-21
-
IT/Tech
Cloud Computing: Infrastructure & Operations, Systems Engineer, SRE/Site Reliability
Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting, automating, and optimizing hybrid and cloud-native production environments that power critical customer-facing services and enterprise applications.
This role combines deep cloud infrastructure expertise with strong production reliability and operational engineering skills. The Principal Engineer acts as both architect and hands-on builder, ensuring scalability, resilience, and security across multi-cloud and on-prem environments.
Reporting to the Associate Director of IT and Infrastructure, this position will collaborate closely with Engineering, Dev Ops, Security, and IT Operations to drive a culture of automation, observability, and continuous improvement across the production ecosystem.
Key Responsibilities Cloud Architecture and Engineering- Design, implement, and maintain cloud and hybrid infrastructure supporting production workloads, enterprise systems, and CI/CD pipelines.
- Lead the adoption of infrastructure-as-code (IaC) using Terraform, Cloud Formation, or similar tools to enable repeatable, auditable, and secure deployments.
- Architect scalable and fault-tolerant solutions across OCI, AWS, Azure, and on-prem data centers, ensuring high availability and cost efficiency.
- Evaluate emerging cloud services and technologies for applicability to business needs and long-term scalability goals.
- Serve as the technical lead for production operations, ensuring uptime, performance, and reliability of customer-facing and internal systems.
- Develop and maintain observability frameworks leveraging metrics, logs, and traces to ensure proactive detection and rapid response.
- Partner with engineering teams to implement SRE-inspired practices, including service level objectives (SLOs), error budgets, and post-incident reviews.
- Drive root cause analysis, performance tuning, and continuous improvement of production services.
- Collaborate with Dev Ops and application engineering teams to build and optimize automated deployment pipelines supporting frequent, low-risk releases.
- Integrate security and compliance checks into CI/CD workflows to ensure production readiness and alignment with internal standards.
- Design self-healing infrastructure and automated rollback mechanisms to reduce operational risk.
- Ensure secure and reliable configuration management and environment orchestration using tools such as Ansible, Chef, or Puppet.
- Establish and enforce operational best practices for monitoring, patching, and change management across production systems.
- Lead production readiness reviews for new releases and large-scale changes.
- Collaborate with the Security and Compliance teams to ensure systems adhere to policy, hardening standards, and regulatory requirements.
- Participate in and occasionally lead on-call rotations for critical production systems, ensuring rapid triage and resolution.
- Act as a technical mentor to cloud and infrastructure engineers, fostering a culture of knowledge sharing and engineering excellence.
- Lead architectural reviews, design sessions, and capacity planning discussions.
- Serve as a trusted advisor to management on cloud modernization, resilience engineering, and cost optimization strategies.
- Bachelor’s degree in Computer Science, Information Systems, or related field;
Master’s preferred. - 10+ years of experience in cloud and infrastructure engineering, including 3+ years in a senior or principal role.
- Expertise with OCI (preferred), AWS and/or Azure cloud services, including networking, compute, storage, and identity management.
- Proven experience managing production-scale environments supporting mission-critical applications and services.
- Strong proficiency in:
- Infrastructure-as-code (Terraform, Cloud Formation)
- CI/CD and Dev Ops tool chains (Jenkins, Git Lab, ArgoCD)
- Container orchestration (Kubernetes, Docker)
- Monitoring and observability platforms (Prometheus, Grafana, Datadog, ELK)
- Scripting and automation (Python, Bash, Power Shell)
- Solid understanding of security, compliance, and networking principles in hybrid environments.
- Exceptional analytical, problem-solving, and incident management skills.
- Demonstrated ability to lead complex, cross-functional initiatives from concept to execution.
- Experience in high-availability SaaS or networking environments.
- Knowledge of Fin Ops, cost optimization, and multi-cloud governance frameworks.
- Familiarity with Zero Trust, identity federation, and cloud access security model.
- Exposure to AI/ML infrastructure or data-driven pipelines is a plus.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).