×
Register Here to Apply for Jobs or Post Jobs. X

Senior Network & Site Reliability Engineer

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Alembic, Inc.
Full Time position
Listed on 2026-06-15
Job specializations:
  • IT/Tech
    Systems Engineer, Network Engineer, Cloud Computing: Infrastructure & Operations, Cybersecurity
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

About Us

Alembic is the pioneering Causal AI platform. We help the world's largest enterprises move past correlation to prove what actually drives business outcomes — the question marketing and growth teams have never been able to answer with confidence. Fortune 100 companies including Nvidia, Delta Air Lines, and Mars use Alembic to make multimillion-dollar decisions on trusted, causal evidence.

We're backed by a $145M Series B from Wndr Co (founded by Jeffrey Katzenberg), Jensen Huang, Joe Montana, Prysm Capital, and Accenture. Our models run on our own NVIDIA DGX Super

POD built on Grace Blackwell infrastructure — one of the fastest private supercomputers in the world. (We've melted GPUs getting here.)

About the Role

We're building infrastructure that has to perform under real-world scale, reliability, and security demands — and we're looking for an engineer who wants to own the foundation it runs on. This isn't a traditional "keep the lights on" role.

You will design and operate the global network and reliability layer behind one of the world's fastest private supercomputers — the fabric powering distributed compute, ML workloads, real-time analytics, and mission-critical enterprise systems. You'll work across networking, systems, automation, observability, and reliability engineering to scale a platform where performance genuinely matters, with real influence over architecture decisions.

It's a strong fit if you like solving deep infrastructure problems, building resilient systems, automating everything repetitive, and owning architecture rather than just maintaining it.

What You'll Do
  • Architect and operate scalable, secure network architecture for high-security requirements and large‑scale machine learning workloads.

  • Own network device configuration management end to end, ensuring consistency and reliability across the fleet.

  • Improve system and network reliability and performance through automation, observability, and proactive capacity planning.

  • Implement and manage complex network protocols and connectivity, including BGP, VPNs, and WAN circuits and external peering.

  • Build and maintain comprehensive monitoring, alerting, and incident response — SLOs, runbooks, and on-call rotations — and drive post‑incident analysis and continuous improvement.

  • Ensure security, compliance, and operational readiness across our network and cloud infrastructure.

  • Partner across engineering and data science to drive a culture of performance and reliability.

What Will Help You Succeed
  • 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.

  • A strong background in network security, architecture, design, and operations.

  • Extensive hands‑on experience with network devices (firewalls, switches, load balancers) and large-scale architectures and protocols — BGP, QoS, MPLS, and IPsec VPNs.

  • Experience designing and operating modern datacenter network fabrics (spine‑leaf, EVPN/VXLAN, ECMP).

  • Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar), plus IPAM/DCIM platforms (Net Box, Infoblox, or similar).

  • WAN engineering — carrier circuit provisioning and external network peering.

  • Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy) and strong operational experience with Linux-based production infrastructure.

  • Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, Open Telemetry).

  • Solid scripting (Python, Bash) to debug complex network and system issues and automate solutions, plus excellent cross‑functional communication.

Also Helpful
  • NVIDIA networking technologies — Cumulus Linux, Infini Band, Spectrum‑X, and Blue Field DPUs (this is the fabric behind our Super

    POD).

  • Familiarity with data‑intensive platforms (Spark, Airflow, Kafka) and storage network protocols (NFS, Lustre

    FS, iSCSI).

  • Security practices for applications and infrastructure, and experience in high‑compliance or SOC 2 environments.

The Role Is Right for You If
  • You want to own mission-critical network and infrastructure end to end — from architecture to incident management — not just keep it running.

  • You’d…

Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary