More jobs:
Chief HPC Network Engineer - AI Infrastructure
Job in
Ann Arbor, Washtenaw County, Michigan, 48103, USA
Listed on 2026-06-11
Listing for:
EPAM Systems, Inc.
Full Time
position Listed on 2026-06-11
Job specializations:
-
IT/Tech
Systems Engineer
Job Description & How to Apply Below
The role focuses on establishing the long-term technical direction, governing architecture decisions across multiple programs, and setting organization-wide engineering standards for high-performance network fabrics supporting massive-scale LLM and distributed AI workloads, including Infini Band/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, Smart
NIC/DPU technologies, and deep network observability. As a principal technical authority, you will shape engineering culture, mentor lead and principal engineers, influence executive client roadmaps, and own end-to-end governance of mission-critical network platforms across the portfolio.
The ideal candidate combines authoritative expertise across Infini Band NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading multiple engineering teams, defining technical strategy at the program level, and shaping industry-leading HPC/AI network platforms.
Responsibilities Define and own the multi-year strategic vision and architectural roadmap for high-performance Infini Band/RDMA and Ethernet fabrics powering massive-scale GPU clusters and distributed AI/LLM workloads across the client portfolio
Govern the design, evaluation, and standardization of cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, and establish enterprise-wide decision frameworks aligned with workload scale, performance, and cost constraints
Establish and enforce organization-wide engineering standards and best practices for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
Set the strategic direction for performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training workloads, and oversee resolution of the most complex systemic performance issues
Define the canonical reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and drive its adoption across programs
Own the strategy and governance for Smart
NIC/DPU technologies such as NVIDIA Blue Field, including SR-IOV, offload, isolation, and security use cases, and align adoption with the broader infrastructure roadmap
Define the enterprise observability strategy for network platforms, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies
Provide technical leadership and mentorship to lead and principal engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, building the talent pipeline and driving cross-functional alignment at scale
Act as the principal technical authority in executive client and stakeholder forums, shaping strategic technical direction, negotiating trade-offs at the program level, and ensuring delivery of reliable, scalable network platforms across multiple engagements
Contribute to the broader engineering community through thought leadership, internal practice development, and representation of the company at industry events
Requirements8+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 4+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (2+ years)
Proven experience defining the architecture and governing delivery of Infini Band/RDMA fabrics, high-speed Ethernet, and Linux networking in large-scale, performance-critical distributed compute environments
Authoritative expertise in host-side networking, including NICs,…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×