Principal Cloud Engineering and Production Operations Engineer Job San Jose area,California USA,IT/Tech

Principal Cloud Engineering and Production Operations Engineer

The Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting, automating, and optimizing hybrid and cloud-native production environments that power critical customer-facing services and enterprise applications.

This role combines deep cloud infrastructure expertise with strong production reliability and operational engineering skills. The Principal Engineer acts as both architect and hands-on builder, ensuring scalability, resilience, and security across multi-cloud and on-prem environments.

Reporting to the Associate Director of IT and Infrastructure, this position will collaborate closely with Engineering, Dev Ops, Security, and IT Operations to drive a culture of automation, observability, and continuous improvement across the production ecosystem.

Key Responsibilities:

Cloud Architecture and Engineering

* Design, implement, and maintain cloud and hybrid infrastructure supporting production workloads, enterprise systems, and CI/CD pipelines

* Lead the adoption of infrastructure-as-code (IaC) using Terraform, Cloud Formation, or similar tools to enable repeatable, auditable, and secure deployments

* Architect scalable and fault-tolerant solutions across OCI, AWS, Azure, and on-prem data centers, ensuring high availability and cost efficiency

* Evaluate emerging cloud services and technologies for applicability to business needs and long-term scalability goals

Production Operations and Reliability

* Serve as the technical lead for production operations, ensuring uptime, performance, and reliability of customer-facing and internal systems

* Develop and maintain observability frameworks leveraging metrics, logs, and traces to ensure proactive detection and rapid response

* Partner with engineering teams to implement SRE-inspired practices, including service level objectives (SLOs), error budgets, and post-incident reviews

* Drive root cause analysis, performance tuning, and continuous improvement of production services

Automation and CI/CD Enablement

* Collaborate with Dev Ops and application engineering teams to build and optimize automated deployment pipelines supporting frequent, low-risk releases

* Integrate security and compliance checks into CI/CD workflows to ensure production readiness and alignment with internal standards

* Design self-healing infrastructure and automated rollback mechanisms to reduce operational risk

* Ensure secure and reliable configuration management and environment orchestration using tools such as Ansible, Chef, or Puppet

Operational Governance and Collaboration

* Establish and enforce operational best practices for monitoring, patching, and change management across production systems

* Lead production readiness reviews for new releases and large-scale changes

* Collaborate with the Security and Compliance teams to ensure systems adhere to policy, hardening standards, and regulatory requirements

* Participate in and occasionally lead on-call rotations for critical production systems, ensuring rapid triage and resolution

Leadership and Mentorship

* Act as a technical mentor to cloud and infrastructure engineers, fostering a culture of knowledge sharing and engineering excellence

* Lead architectural reviews, design sessions, and capacity planning discussions

* Serve as a trusted advisor to management on cloud modernization, resilience engineering, and cost optimization strategies

Qualifications:

* Bachelor's degree in Computer Science, Information Systems, or related field;
Master's preferred

* 10+ years of experience in cloud and infrastructure engineering, including 3+ years in a senior or principal role

* Expertise with OCI (preferred), AWS and/or Azure cloud services, including networking, compute, storage, and identity management

* Proven experience managing production-scale environments supporting mission-critical applications and services

* Strong proficiency in:

* Infrastructure-as-code (Terraform, Cloud Formation)

* CI/CD and Dev Ops tool chains (Jenkins, Git Lab, ArgoCD)

* Container orchestration (Kubernetes, Docker)

* Monitoring and observability platforms (Prometheus, Grafana, Datadog, ELK)

* Scripting and automation (Python, Bash, Power Shell)

* Solid understanding of security, compliance, and networking principles in hybrid environments

* Exceptional analytical, problem-solving, and incident management skills

* Demonstrated ability to lead complex, cross-functional initiatives from concept to execution

Preferred Experience:

* Experience in high-availability SaaS or networking environments

* Knowledge of Fin Ops, cost optimization, and multi-cloud governance frameworks

* Familiarity with Zero Trust, identity federation, and cloud access security model

* Exposure to AI/ML infrastructure or data-driven pipelines is a plus

Why Join Us:

This is a hands-on leadership opportunity to define the next generation of cloud and production operations within a high-impact technology environment. The…