Senior AI Infrastructure Engineer
Listed on 2026-05-27
-
IT/Tech
Systems Engineer, Network Engineer
Senior Staff Engineer, Software
Date: May 25, 2026
General OverviewFunctional Area: Engineering
Career Stream: Design - Software Engineering
Job Code: SSE-ENG-DSE
Job Level: Level 11
IC/MGR: Individual Contributor
Direct/Indirect Indicator: Indirect
This is a high-impact, hands‑on technical leadership role where you will architect and build systems that enable deployment, monitoring, and optimization of large‑scale infrastructure supporting AI workloads across modern data center environments.
You will operate at the intersection of:
- Infrastructure management, monitoring, and diagnostics
This role requires deep technical expertise along with the ability to drive end‑to‑end solutions from architecture through deployment and troubleshooting.
Detailed Description- Lead the architecture, design, and development of scalable AI infrastructure platforms supporting GPU‑based data center environments
- Build and enhance orchestration systems responsible for infrastructure deployment, provisioning, monitoring, and lifecycle management
- Design distributed systems with a focus on scalability, resiliency, fault tolerance, concurrency, and performance optimization
- Develop infrastructure observability and diagnostics capabilities across GPU, networking, and storage environments
- Define telemetry, health monitoring, and performance validation strategies for large‑scale AI infrastructure deployments
- Develop and support data center networking and orchestration workflows including ZTP, DHCP, provisioning, and automated infrastructure configuration
- Work across modern AI fabric and data center networking architectures including Clos fabrics, EVPN, and L2/L3 networking environments
- Write high‑performance backend software and infrastructure services using Python or Go within Kubernetes‑based environments
- Troubleshoot and resolve complex infrastructure, networking, orchestration, and performance issues in live production data center environments
- Lead root cause analysis efforts and drive issues through resolution across software, networking, and infrastructure layers
- Partner cross‑functionally with engineering, hardware, platform, lab, and customer teams to support deployments and operational success
- Drive technical direction, architecture decisions, engineering best practices, and mentorship across the organization
- Translate real‑world deployment challenges into scalable engineering solutions that improve reliability, automation, and operational efficiency
- Operate as a hands‑on technical leader capable of driving initiatives from architecture and development through deployment and production support
- 12+ years of experience in software engineering focused on infrastructure, distributed systems, networking, or large‑scale platform development
- Strong expertise in data center networking fundamentals including:
- L2/L3 networking
- BGP and EVPN
- Clos fabrics and AI networking architectures
- Proven experience designing and building scalable distributed systems in production environments
- Hands‑on experience with infrastructure orchestration, provisioning, and large‑scale data center deployments
- Strong programming experience in Python or Go
- Experience building systems within Kubernetes‑based environments
- Strong understanding of system scalability, concurrency, resiliency, and performance optimization
- Demonstrated ability to troubleshoot and debug complex multi‑layer production systems
- Strong communication and collaboration skills with the ability to work across technical and non‑technical teams
- Experience with AI/ML infrastructure, GPU clusters, or high‑performance computing (HPC) environments
- Experience with AI infrastructure monitoring, observability, and diagnostics platforms
- Familiarity with AI workload orchestration and scheduling systems
- Experience with infrastructure automation tools such as Ansible
- Experience supporting customer deployments and external stakeholder engagements
- Background supporting large‑scale data center or cloud infrastructure platforms
- 12+ Years
Bachelor degree or consideration of an equivalent combination of education and experience.
Educational Requirements may vary by Geography
NotesThis…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).