Senior Network Engineer,Operations Job San Francisco area,California USA,IT/Tech

Position: Senior Staff Network Engineer, Operations

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About this Role

Crusoe Cloud is seeking a Senior Staff Network Operations Engineer to own production reliability across our global network, including edge, backbone, data center fabric, and GPU cluster interconnects. You will drive incident response, root cause analysis, and the operational excellence initiatives that keep our hyperscale AI infrastructure healthy at scale.

This is a senior production ownership role, not architecture, not pre-sales, not purely automation. You will set operational standards, define SLIs and SLOs, mentor Staff and Senior engineers, and serve as the senior escalation point during high-severity events. This is the role that keeps the network up.

What You'll Be Working On

Own Production Reliability: Serve as the senior technical owner for uptime of Crusoe's global edge, backbone, data center, and GPU cluster network, directly affecting the availability of AI workloads running on hundreds of thousands of GPUs.
Lead Incident Response: Own end-to-end response for high-severity network events, including rapid mitigation, stakeholder communication, and postmortem documentation that prevents recurrence.
Drive Root Cause Analysis: Lead RCAs for production incidents, identify systemic issues, author remediation plans, and track them to closure.
Define SLIs and SLOs: Partner with Architecture and Site Reliability to define network reliability metrics and service level objectives, backed by real-time dashboards and alerting.
Set Operational Standards: Author and maintain runbooks, escalation playbooks, and SOPs used by the broader operations team.
Improve Observability: Drive continuous improvement of Crusoe's network monitoring stack including streaming telemetry, SNMP, Net Flow, and tools such as Kentik, Grafana, Prometheus, and Thousand Eyes.
Build Operational Automation: Write Python-based auto-remediation tooling that reduces toil and accelerates mean time to resolution for known failure modes.
Mentor and Multiply: Provide technical guidance to Staff and Senior engineers. Drive post-incident learning and build a culture of operational excellence across the team.

What You'll Bring to the Team

12+ years of production network engineering experience with a demonstrated focus on large-scale operations, incident response, and reliability in hyperscale or internet-scale environments.
Observability and Monitoring: Hands-on experience with streaming telemetry, SNMP, Net Flow, sFlow, and tools such as Kentik, Grafana, Prometheus, Thousand Eyes, and Arbor.
GPU Cluster and RDMA Networking: Hands-on experience operating RDMA/RoCE (v1 and v2) lossless fabrics for GPU and HPC workloads, including PFC, ECN, and DCQCN tuning. Required at this level.
Demonstrated Technical Leadership: Proven track record owning production reliability at scale, leading RCAs that drove systemic change, and setting operational standards the broader org executes against.
Hyperscale Operational Depth: Comfort operating 10K+ device fleets across multi-region environments…

Senior Network Engineer, Operations