Senior Network & Site Reliability Engineer Job San Francisco area,California USA,IT/Tech

About Us

Alembic is the pioneering Causal AI platform. We help the world's largest enterprises move past correlation to prove what actually drives business outcomes — the question marketing and growth teams have never been able to answer with confidence. Fortune 100 companies including Nvidia, Delta Air Lines, and Mars use Alembic to make multimillion-dollar decisions on trusted, causal evidence.

We're backed by a $145M Series B from Wndr Co (founded by Jeffrey Katzenberg), Jensen Huang, Joe Montana, Prysm Capital, and Accenture. Our models run on our own NVIDIA DGX Super

POD built on Grace Blackwell infrastructure — one of the fastest private supercomputers in the world. (We've melted GPUs getting here.)

About the Role

We're building infrastructure that has to perform under real-world scale, reliability, and security demands — and we're looking for an engineer who wants to own the foundation it runs on. This isn't a traditional "keep the lights on" role.

You will design and operate the global network and reliability layer behind one of the world's fastest private supercomputers — the fabric powering distributed compute, ML workloads, real-time analytics, and mission-critical enterprise systems. You'll work across networking, systems, automation, observability, and reliability engineering to scale a platform where performance genuinely matters, with real influence over architecture decisions.

It's a strong fit if you like solving deep infrastructure problems, building resilient systems, automating everything repetitive, and owning architecture rather than just maintaining it.

What You'll Do

Architect and operate scalable, secure network architecture for high-security requirements and large‑scale machine learning workloads.
Own network device configuration management end to end, ensuring consistency and reliability across the fleet.
Improve system and network reliability and performance through automation, observability, and proactive capacity planning.
Implement and manage complex network protocols and connectivity, including BGP, VPNs, and WAN circuits and external peering.
Build and maintain comprehensive monitoring, alerting, and incident response — SLOs, runbooks, and on-call rotations — and drive post‑incident analysis and continuous improvement.
Ensure security, compliance, and operational readiness across our network and cloud infrastructure.
Partner across engineering and data science to drive a culture of performance and reliability.

What Will Help You Succeed

8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.
A strong background in network security, architecture, design, and operations.
Extensive hands‑on experience with network devices (firewalls, switches, load balancers) and large-scale architectures and protocols — BGP, QoS, MPLS, and IPsec VPNs.
Experience designing and operating modern datacenter network fabrics (spine‑leaf, EVPN/VXLAN, ECMP).
Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar), plus IPAM/DCIM platforms (Net Box, Infoblox, or similar).
WAN engineering — carrier circuit provisioning and external network peering.
Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy) and strong operational experience with Linux-based production infrastructure.
Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, Open Telemetry).
Solid scripting (Python, Bash) to debug complex network and system issues and automate solutions, plus excellent cross‑functional communication.

Also Helpful

NVIDIA networking technologies — Cumulus Linux, Infini Band, Spectrum‑X, and Blue Field DPUs (this is the fabric behind our Super

POD).
Familiarity with data‑intensive platforms (Spark, Airflow, Kafka) and storage network protocols (NFS, Lustre

FS, iSCSI).
Security practices for applications and infrastructure, and experience in high‑compliance or SOC 2 environments.

The Role Is Right for You If

You want to own mission-critical network and infrastructure end to end — from architecture to incident management — not just keep it running.
You’d…