Compute Platform Engineer Job Dallas area,Texas USA,IT/Tech

Compute platform engineer page is loaded## compute platform engineer locations:
dallas, txtime type:
full time posted on:
posted todayjob requisition :
r13099the companynorthmark compute & cloud (nmc[2]) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (hpc) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.the

position the compute platform engineer role is responsible for the day-to-day reliability, performance, and operational health of our high-performance compute platforms that support critical research and production workloads. This position focuses on maintaining and troubleshooting cpu and gpu infrastructure, coordinating with vendors, and ensuring systems operate consistently king closely with platform, infrastructure, and operations teams, the role plays a key part in sustaining a stable compute environment.we

are seeking a highly skilled and motivated engineer to join our compute platform management team. In this role, you will take ownership of the reliability and operational excellence of our high-performance computing infrastructure, which underpins our firm’s research and production workloads.as a compute platform engineer, you will be responsible for identifying and resolving hardware issues, coordinating with vendors and ensuring compute nodes (cpu and gpu) maintain peak performance.

This contract role is ideal for someone who thrives in technically demanding environments and is eager to contribute to the continuous evolution of our compute platform.responsibilities:
* designing, configuring, and manage a high performance compute infrastructure made up of gpu and cpu nodes
* manage the full firmware/bios lifecycle across our hpc/ai fleet – from baselines and validation through rollout and compliance.
* troubleshoot hardware components (cpu, gpu, dpu, nvswitch, nics, memory, psu, bmc) and guide replacement or configuration changes. Diagnose and automate recurring hardware issues to improve reliability and reduce recovery time.
* work on the latest ai platforms from day one (e.g., nvl
72 / grace blackwell), ensuring they are stable, performant, and ready for production use.
* monitoring hardware performance, identifying areas for improvement, and implementing solutions
* automate health checks and onboarding workflows to accelerate safe deployment.
* collaborate with vendors on firmware issues – providing clear repro cases, logs, and impact to drive fixes and improvements.
* recommend process, tooling, and architectural improvements to strengthen platform operations.
* performing diagnostics, tuning, and capacity planning to ensure smooth scale-out
* performing analysis of existing hardware lifecycle processes and providing recommendations for improvement and optimization
* collaborating with various teams to integrate hardware improvements and align with organizational goals
* implementing best practices for security hardening of the platform and associated systems
* mentoring junior engineers and fostering a culture of continuous learning and improvement
* acting as a subject matter expert, providing guidance and support for infrastructure-related issues
* leveraging infrastructure as code (iac) methodologies to ensure efficient and scalable infrastructure management requirements:
* 3+ years of hands-on experience supporting large-scale compute platforms
* proficiency with hpe server infrastructure, such as proliant and apollo, and nvidia gpus, including a100 and h200
* solid understanding of server architecture, including uefi/bios, pcie devices and out-of-band management systems, such as ilo and bmc)
* proven ability to resolve complex hardware issues and manage vendor relationships
* familiarity with automation tools such as ansible, terraform and ci/cd systems
* working knowledge…