×
Register Here to Apply for Jobs or Post Jobs. X

Chief HPC Network Engineer - AI Infrastructure

Job in Ann Arbor, Washtenaw County, Michigan, 48103, USA
Listing for: EPAM Systems, Inc.
Full Time position
Listed on 2026-06-11
Job specializations:
  • IT/Tech
    Systems Engineer
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below
We are looking for a Chief HPC Network Engineer to define the global technical strategy, reference architecture, and engineering vision behind advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.

The role focuses on establishing the long-term technical direction, governing architecture decisions across multiple programs, and setting organization-wide engineering standards for high-performance network fabrics supporting massive-scale LLM and distributed AI workloads, including Infini Band/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, Smart

NIC/DPU technologies, and deep network observability. As a principal technical authority, you will shape engineering culture, mentor lead and principal engineers, influence executive client roadmaps, and own end-to-end governance of mission-critical network platforms across the portfolio.

The ideal candidate combines authoritative expertise across Infini Band NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading multiple engineering teams, defining technical strategy at the program level, and shaping industry-leading HPC/AI network platforms.

Responsibilities Define and own the multi-year strategic vision and architectural roadmap for high-performance Infini Band/RDMA and Ethernet fabrics powering massive-scale GPU clusters and distributed AI/LLM workloads across the client portfolio

Govern the design, evaluation, and standardization of cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, and establish enterprise-wide decision frameworks aligned with workload scale, performance, and cost constraints

Establish and enforce organization-wide engineering standards and best practices for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths

Set the strategic direction for performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training workloads, and oversee resolution of the most complex systemic performance issues

Define the canonical reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and drive its adoption across programs

Own the strategy and governance for Smart

NIC/DPU technologies such as NVIDIA Blue Field, including SR-IOV, offload, isolation, and security use cases, and align adoption with the broader infrastructure roadmap

Define the enterprise observability strategy for network platforms, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies

Provide technical leadership and mentorship to lead and principal engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, building the talent pipeline and driving cross-functional alignment at scale

Act as the principal technical authority in executive client and stakeholder forums, shaping strategic technical direction, negotiating trade-offs at the program level, and ensuring delivery of reliable, scalable network platforms across multiple engagements

Contribute to the broader engineering community through thought leadership, internal practice development, and representation of the company at industry events

Requirements8+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 4+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (2+ years)
Proven experience defining the architecture and governing delivery of Infini Band/RDMA fabrics, high-speed Ethernet, and Linux networking in large-scale, performance-critical distributed compute environments

Authoritative expertise in host-side networking, including NICs,…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary