×
Register Here to Apply for Jobs or Post Jobs. X

Engineering Manager, HPC Kubernetes Platform

Job in Dallas, Dallas County, Texas, 75215, USA
Listing for: NMC2
Full Time position
Listed on 2025-12-20
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Job Description & How to Apply Below

North Mark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

We are seeking an experienced Engineering Manager, HPC Kubernetes Platform to lead the team responsible for designing and scaling our bare-metal Kubernetes environment—the orchestration layer powering GPU- and CPU-intensive machine-learning and HPC workloads across global datacenters.

This is a hands‑on leadership role focused on platform performance, reliability, and automation. You will define the technical roadmap, guide system architecture and optimization, and ensure our Kubernetes platform delivers top‑tier reliability and throughput for distributed ML and HPC environments. The ideal candidate is a strong technical leader who thrives at the intersection of infrastructure engineering, AI systems, and high-performance computing.

Responsibilities
  • Lead and mentor engineers designing and scaling NMC²’s bare‑metal Kubernetes platform for HPC and ML workloads.
  • Architect and optimize GPU/CPU scheduling, resource management, and performance across multi‑tenant compute clusters.
  • Drive automation and observability using Infrastructure-as-Code, CI/CD, and SRE best practices.
  • Collaborate with Research, Storage, and Network teams to integrate distributed file systems, high-speed interconnects (Infini Band, RoCE), and custom runtimes.
  • Partner with hardware and software vendors to improve tooling, influence product roadmaps, and streamline deployment.
  • Oversee platform reliability, capacity forecasting, and performance KPIs across thousands of nodes.
Requirements
  • 7+ years in infrastructure, platform, or SRE engineering, including 2+ in technical leadership.
  • Proven experience operating Kubernetes environments tailored for HPC or ML training workloads—GPU scheduling, resource isolation, and workload optimization.
  • Deep knowledge of Linux systems, networking, and performance engineering on bare-metal hardware.
  • Experience managing large-scale, multi-tenant clusters and integrating distributed storage or high-speed networking.
  • Strong automation experience (Terraform, Ansible, or similar) and familiarity with observability tools (Prometheus, Grafana, Loki).
  • Excellent communication and stakeholder management skills; ability to translate complex technical direction into clear, actionable plans.
  • Bachelor’s Degree or equivalent experience.
Nice-to-Haves
  • Familiarity with HPC schedulers (Slurm, Flux) and container runtimes (containerd, CRI‑O).
  • Contributions to open-source Kubernetes or ML infrastructure projects.
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary