×
Register Here to Apply for Jobs or Post Jobs. X

Principal Cloud and Production Operations Engineer

Job in San Jose, Santa Clara County, California, 95199, USA
Listing for: Qode
Full Time position
Listed on 2026-06-20
Job specializations:
  • IT/Tech
    Cloud Computing: Infrastructure & Operations, Systems Engineer, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 140000 - 180000 USD Yearly USD 140000.00 180000.00 YEAR
Job Description & How to Apply Below

Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting, automating, and optimizing hybrid and cloud-native production environments that power critical customer-facing services and enterprise applications.

This role combines deep cloud infrastructure expertise with strong production reliability and operational engineering skills. The Principal Engineer acts as both architect and hands-on builder, ensuring scalability, resilience, and security across multi-cloud and on-prem environments.

Reporting to the Associate Director of IT and Infrastructure, this position will collaborate closely with Engineering, Dev Ops, Security, and IT Operations to drive a culture of automation, observability, and continuous improvement across the production ecosystem.

Key Responsibilities Cloud Architecture and Engineering
  • Design, implement, and maintain cloud and hybrid infrastructure supporting production workloads, enterprise systems, and CI/CD pipelines.
  • Lead the adoption of infrastructure-as-code (IaC) using Terraform, Cloud Formation, or similar tools to enable repeatable, auditable, and secure deployments.
  • Architect scalable and fault-tolerant solutions across OCI, AWS, Azure, and on-prem data centers, ensuring high availability and cost efficiency.
  • Evaluate emerging cloud services and technologies for applicability to business needs and long-term scalability goals.
Production Operations and Reliability
  • Serve as the technical lead for production operations, ensuring uptime, performance, and reliability of customer-facing and internal systems.
  • Develop and maintain observability frameworks leveraging metrics, logs, and traces to ensure proactive detection and rapid response.
  • Partner with engineering teams to implement SRE-inspired practices, including service level objectives (SLOs), error budgets, and post-incident reviews.
  • Drive root cause analysis, performance tuning, and continuous improvement of production services.
Automation and CI/CD Enablement
  • Collaborate with Dev Ops and application engineering teams to build and optimize automated deployment pipelines supporting frequent, low-risk releases.
  • Integrate security and compliance checks into CI/CD workflows to ensure production readiness and alignment with internal standards.
  • Design self-healing infrastructure and automated rollback mechanisms to reduce operational risk.
  • Ensure secure and reliable configuration management and environment orchestration using tools such as Ansible, Chef, or Puppet.
Operational Governance and Collaboration
  • Establish and enforce operational best practices for monitoring, patching, and change management across production systems.
  • Lead production readiness reviews for new releases and large-scale changes.
  • Collaborate with the Security and Compliance teams to ensure systems adhere to policy, hardening standards, and regulatory requirements.
  • Participate in and occasionally lead on-call rotations for critical production systems, ensuring rapid triage and resolution.
Leadership and Mentorship
  • Act as a technical mentor to cloud and infrastructure engineers, fostering a culture of knowledge sharing and engineering excellence.
  • Lead architectural reviews, design sessions, and capacity planning discussions.
  • Serve as a trusted advisor to management on cloud modernization, resilience engineering, and cost optimization strategies.
Qualifications
  • Bachelor’s degree in Computer Science, Information Systems, or related field;
    Master’s preferred.
  • 10+ years of experience in cloud and infrastructure engineering, including 3+ years in a senior or principal role.
  • Expertise with OCI (preferred), AWS and/or Azure cloud services, including networking, compute, storage, and identity management.
  • Proven experience managing production-scale environments supporting mission-critical applications and services.
  • Strong proficiency in:
    • Infrastructure-as-code (Terraform, Cloud Formation)
    • CI/CD and Dev Ops tool chains (Jenkins, Git Lab, ArgoCD)
    • Container orchestration (Kubernetes, Docker)
    • Monitoring and observability platforms (Prometheus, Grafana, Datadog, ELK)
    • Scripting and automation (Python, Bash, Power Shell)
  • Solid understanding of security, compliance, and networking principles in hybrid environments.
  • Exceptional analytical, problem-solving, and incident management skills.
  • Demonstrated ability to lead complex, cross-functional initiatives from concept to execution.
Preferred Experience
  • Experience in high-availability SaaS or networking environments.
  • Knowledge of Fin Ops, cost optimization, and multi-cloud governance frameworks.
  • Familiarity with Zero Trust, identity federation, and cloud access security model.
  • Exposure to AI/ML infrastructure or data-driven pipelines is a plus.
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary