More jobs:
HPC and Validation Engineer
Job in
Dallas, Dallas County, Texas, 75215, USA
Listed on 2026-05-31
Listing for:
NorthMark Strategies LLC
Full Time
position Listed on 2026-05-31
Job specializations:
-
IT/Tech
Systems Engineer
Job Description & How to Apply Below
The Company North Mark Compute & Cloud (NMC²) operates at the bleeding edge of technology, aiming to scale and enhance high-performance computing (HPC) and cloud infrastructure that supports clients’ research, production, and delivery. Engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.
Responsibilities- Architect and implement a validation framework to certify the readiness and utilization of GPU nodes across a large, distributed HPC environment.
- Define methodologies to continually assess performance and optimize infrastructure across AI/ML workloads.
- Develop and execute comprehensive performance testing using industry and customer‑specific benchmarks, ensuring optimal performance across HPC compute, storage, and networking.
- Contribute to research reports describing benchmarking discoveries, evaluating complete hardware performance and efficiency.
- Lead debugging, identify, and resolve bottlenecks in system performance.
- Build robust, scalable tools for automated validation and testing, utilizing Python, Go, Kubernetes, and CI/CD pipelines to streamline continuous validation and benchmarking processes.
- Implement monitoring solutions using Prometheus, Grafana, and other modern monitoring technologies to track performance metrics and real‑time health of the cluster.
- Define and implement best practices for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge.
- Stay informed on industry trends and advancements to ensure long‑term strategic alignment.
- Work cross‑functionally with engineering, infrastructure, and research teams to align validation efforts with broader business objectives, ensuring the platform meets evolving research demands.
- Accelerator performance experience, including profiling and tuning with large‑scale GPU clusters.
- In‑depth understanding of NVIDIA Cluster Kit, Nsight, and Validation Suite, MLPerf and DCGM tools for GPU and DPUs.
- Networking & storage performance experience, including profiling and optimisation with NVIDIA Cluster Kit, iPerf or equivalent across Infini Band/RoCe network implementations.
- System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent.
- Experience with HPC workloads across distributed global locations, providing data‑driven performance data to complement key architectural decisions.
- Strong proficiency in developing automation tools and micro‑benchmarking frameworks for validation using Python, Go, and Kubernetes in an Ubuntu Linux environment.
- Expertise with key monitoring platforms including OTEL, Prometheus, ELK, and Grafana, and in defining and implementing the overall observability strategy for HPC validation and performance monitoring.
- Deep understanding of emerging technologies, architectures, and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long‑term plan.
- Proven ability to lead complex technical projects, influence decisions, and engage with stakeholders across technical and research teams.
North Mark Compute & Cloud (NMC²) is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability, or veteran status.
#J-18808-LjbffrTo View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×