Senior DevOps Lead Engineer; AI Acceleration- Hybrid Job Santa Clara area,California USA,IT/Tech

Position: Senior DevOps Lead Engineer (AI Acceleration)- Hybrid

Overview

The Role:

You will be the senior Dev Ops technical lead on the Infrastructure team, owning the CI/CD pipelines, container infrastructure, observability stack, and shared tooling that AI/ML hardware accelerator development runs on in the lab, in the cloud, and across colocations at scale.

Because we design and manufacture AI acceleration silicon, a core part of this is working with internal cloud and lab physical systems: automating and operating on‑premises GPU clusters, high‑speed interconnects, and lab server infrastructure not just cloud resources. You will build the automation layer that ties lab hardware, cloud environments, and developer tooling into a single, reliable system.

You will also be instrumental in scaling that system globally, as they build toward a follow‑the‑sun Dev Ops model across its expanding engineering sites.

Responsibilities Dev Ops Leadership

Own CI/CD pipelines, runners, and execution environments across software, silicon, hardware, and ML teams Git Lab CI, Git Hub Actions, and build systems like Bazel.

Build and maintain automated provisioning and deployment pipelines for GPU driver stacks, AI/ML frameworks (PyTorch, Tensor Flow), and inference software; implement container‑based test harnesses (Docker/Kubernetes/Singularity) that verify driver and framework compatibility across hardware generations (NVIDIA, AMD, Intel).

Improve pipeline performance through parallelization, caching, and architectural changes; maintain the Docker image library supporting AI/ML workload testing across distributions and framework versions.

Automation & Infrastructure as Code

Own IaC and configuration management (Terraform, Ansible, Python, Go, Bash) across lab, on‑prem, colo, and cloud (AWS, Azure, Google Cloud Platform) covering GPU/CPU driver provisioning through infrastructure deployments, with remote state management, environment isolation, and plan validation.

Build automation to eliminate toil and enforce consistency across team workflows; implement auto‑remediation where appropriate with blast‑radius controls and approval gates for production systems.

Operate and automate Kubernetes clusters and HPC container environments (Singularity/Apptainer) across cloud and on‑premises installation, upgrades, workload management, and troubleshooting.

Observability, Reliability & Incident Response

Design and maintain dashboards, alerting, and monitoring (Promethe Grafana, Data Dog) across CI runners, lab hardware, GPU utilization, and shared services; define SLOs/SLIs and lead structured incident response when they are breached.

Lead incident triage from bare metal to application layer resolving infrastructure, software, and hardware faults across CI/CD, lab, container, and cloud environments, including GPU drivers, framework crashes, and network issues.

Documentation & Global Collaboration

Create and maintain high‑quality documentation: architecture diagrams, troubleshooting guides, onboarding materials, and API/tool references.

Partner with Global Dev Ops and SRE team members to build a consistent, scalable operating model.

Serve as a technical resource across engineering teams developing and sharing best practices, raising technical debt and reliability risks early, and always coming with a proposed plan.

Drive innovation by supporting R&D activities and leading proof‑of‑concept (POC) and proof‑of‑value (POV) evaluations for new tooling, infrastructure patterns, and accelerator technologies.

Qualifications Required

Bachelor’s or Master’s in Computer Science, Electrical Engineering, or related field with 10 years of hands‑on Dev Ops/infrastructure experience (8 years minimum).
Deep Linux systems expertise: package management, networking (TCP/IP stack, routing, bonding), storage, systemd, kernel parameters, and performance tuning.
Production‑grade Git based CI/CD experience: pipeline design, runner management, merge request workflows, caching, and artifact handling.
Strong Python and/or Bash scripting for automation, with the ability to write clean, tested, maintainable code not just one‑off scripts.
Hands‑on Ansible experience writing playbooks from scratch for complex, multi‑host configuration…