×
Register Here to Apply for Jobs or Post Jobs. X

Senior Production Engineer, Operational Excellence

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: Crusoe
Full Time position
Listed on 2026-06-26
Job specializations:
  • IT/Tech
    SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 172000 - 209000 USD Yearly USD 172000.00 209000.00 YEAR
Job Description & How to Apply Below

About This Role

Crusoe is building the most reliable, energy‑efficient, AI‑optimised cloud platform. Production Engineering sits at the heart of that mission. As a Production Engineer focused on Operational Excellence, you will help ensure the reliability, scalability, and performance of Crusoe’s GPU cloud that powers next‑generation AI workloads.

This role is ideal for engineers who enjoy solving complex production problems, improving large‑scale distributed systems, and building automation that keeps infrastructure running smoothly. You’ll play a key role in strengthening the operational foundation of Crusoe’s cloud while helping scale infrastructure that supports demanding AI and HPC workloads.

You’ll partner closely with Production Engineers, infrastructure teams, and platform engineers to improve system reliability, reduce operational toil, and drive continuous improvements across Crusoe’s rapidly growing GPU cloud.

What You’ll Be Working On
  • Collaborate with cross‑functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
  • Participate in production incident response, diagnosing and resolving service disruptions while contributing to post‑incident reviews and root cause analysis
  • Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alert manager, and Open Telemetry
  • Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
  • Develop automation and tooling that reduces operational toil, improves recovery times, and enables self‑healing infrastructure
  • Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
  • Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organisation
  • Continue growing technical depth through mentorship, training, and hands‑on work operating large‑scale AI infrastructure
What You’ll Bring to the Team
  • 5+ years of experience in Production Engineering, SRE, or large‑scale infrastructure operations
  • Experience supporting GPU workloads, HPC environments, or latency/throughput‑sensitive distributed systems
  • Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
  • Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
  • Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
  • Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
  • Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area
  • Familiarity with infrastructure‑as‑code and configuration management tools such as Terraform or Ansible
  • Scripting or programming experience with languages such as Go, Python, C, or C++
  • Strong communication skills and the ability to collaborate across engineering teams
  • Ability to remain calm and effective while troubleshooting complex issues in high‑impact production environments
  • A growth mindset and strong interest in reliability engineering, automation, and operational excellence
Bonus Points
  • Experience working with Kubernetes or container orchestration platforms at scale
  • Exposure to change management processes, operational readiness reviews, or structured root cause analysis
  • Experience designing self‑healing systems, automated remediation, or event‑driven operational tooling
  • Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU‑heavy environments
  • Passion for mentorship, learning, and developing deeper expertise in Production Engineering
Benefits
  • Industry competitive pay
  • Restricted Stock Units in a fast‑growing, well‑funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid parental leave
  • Paid life insurance, short‑term and…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary