×
Register Here to Apply for Jobs or Post Jobs. X

Production Engineer; Operational Excellence

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: Crusoe
Full Time position
Listed on 2026-05-20
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing, Systems Engineer
Salary/Wage Range or Industry Benchmark: 209000 - 253000 USD Yearly USD 209000.00 253000.00 YEAR
Job Description & How to Apply Below
Position: Staff Production Engineer (Operational Excellence)

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem‑solving, opportunity‑finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high‑performing team that believes in each other, come build with us at Crusoe.

About This Role:

Crusoe is building the most reliable, energy‑efficient, AI‑optimized cloud platform — and Production Engineering sits at the heart of that mission. As a Staff Production Engineer focused on Operational Excellence, you will help ensure the reliability, scalability, and performance of Crusoe's GPU cloud that powers next‑generation AI workloads.

This role is ideal for senior engineers who enjoy solving complex production problems, leading reliability strategy across large‑scale distributed systems, and building automation that keeps infrastructure running smoothly. You'll play a key role in strengthening the operational foundation of Crusoe's cloud while helping scale infrastructure that supports demanding AI and HPC workloads.

You'll partner closely with Production Engineers, infrastructure teams, and platform engineers to improve system reliability, reduce operational toil, and drive continuous improvements across Crusoe's rapidly growing GPU cloud.

What You'll Be Working On:

  • Lead cross‑functional efforts to define and evolve availability metrics for Crusoe's cloud platform, including establishing, measuring, and improving SLIs and SLOs

  • Drive production incident response, diagnosing and resolving service disruptions while leading post‑incident reviews and root cause analysis

  • Architect, operate, and improve observability across Crusoe's infrastructure using tools such as Prometheus, Grafana, Alert manager, and Open Telemetry

  • Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems

  • Design and develop automation and tooling that reduces operational toil, improves recovery times, and enables self‑healing infrastructure

  • Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities

  • Define and champion operational processes, knowledge sharing, and reliability best practices across the engineering organization

  • Mentor and grow junior and mid‑level engineers, helping build technical depth across the team

What You'll Bring to the Team:

  • Bachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)

  • 8+ years of experience in Production Engineering, SRE, or large‑scale infrastructure operations

  • Demonstrated experience supporting GPU workloads, HPC environments, or latency/throughput‑sensitive distributed systems

  • Previous experience in Infrastructure roles building or managing compute, storage or networking platforms

  • Deep knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space

  • Strong understanding of modern cloud infrastructure fundamentals, including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)

  • Proven track record with incident management practices and reliability frameworks (SRE, ITIL, or similar)

  • Hands‑on experience with monitoring and observability tools such as Prometheus and Grafana

  • Experi…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary