Solutions Engineer - AI/HPC Infrastructure
Listed on 2026-01-02
-
IT/Tech
Systems Engineer -
Engineering
Systems Engineer
Solutions Engineer - AI/HPC Infrastructure
remote - WFH with Travel to customers
East Coast and Central Coast timezones preferred
Drive Nets is a leader in disaggregated high‑scale networking solutions for service providers and AI infrastructures. Founded in December 2015, Drive Nets created a radical new way to build networks by adapting the architectural model of the cloud to telco‑grade networking. This solution accelerates network deployment, improves the network’s economic model, and radically simplifies network operations. With customers including Comcast, Orange, and KDDI – over 80% of AT&T’s network traffic now runs through an disaggregated core powered by Drive Nets software.
Drive Nets Network Cloud‑AI solution, based on the same technology, was introduced to the market in 2023, providing the highest‑performance Ethernet‑based AI networking solution, and is already deployed by Hyperscalers, Neo Clouds and Enterprises. Raising over $587 million in three funding rounds, Drive Nets continues to deploy the most innovative network infrastructure and is looking for the most talented people to be part of this journey.
As a Solution Engineer, you will play a pivotal role in designing, deploying, and optimizing Drive Nets’ Network Cloud AI Infrastructure solutions. This individual contributor role requires a blend of technical expertise, leadership, and hands‑on experience to implement cutting‑edge solutions for our customers. You will collaborate with sales engineering teams, customers, and cross‑functional teams – including Product Management, Solution Architects, Engineering, and Marketing – to define technical requirements, articulate solution value, and ensure successful deployment on‑site.
Key responsibilities include guiding customers through the design and deployment process, aligning technical solutions with business needs, and providing critical feedback to improve Drive Nets’ product offerings. This position demands strong technical acumen, exceptional communication skills, and the ability to lead complex, high‑impact projects in dynamic environments.
Responsibilities- Building robust AI/HPC infrastructure for new and existing customers.
- Technical hands‑on role in building and supporting NVIDIA/AMD‑based platforms.
- Support operational and reliability aspects of large‑scale AI clusters, focusing on performance at scale, training stability, real‑time monitoring, logging, and alerting.
- Administer Linux systems, ranging from powerful GPU‑enabled servers to general‑purpose compute systems.
- Design and plan rack layouts and network topologies to support customer requirements.
- Design and evaluate automation scripts for network operations, configuring server and switch fabrics.
- Perform NCCL, RCCL, LLM, and RDMA performance benchmarks as part of the design and evaluation process of the deployment.
- Benchmark the latest GPU compute and NIC solutions by all major compute vendors over the Drive Nets networking fabric.
- Install and configure Drive Nets products, ensuring optimal performance and customer satisfaction.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement.
- Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
- Introduce new products to the Drive Nets sales and support teams and to Drive Nets customers.
- Deliver technical trainings and TOIs for support/sales engineers, partners, and customers.
- Collaborate on product definition through customer requirement gathering and roadmap planning.
What we need to see:
- 5+ years of previous experience deploying and administering AI/HPC clusters or general‑purpose compute systems.
- 5+ years of hands‑on Linux experience (e.g., RHEL, CentOS, Ubuntu) and production infrastructure support (e.g., networking, storage, monitoring, compute, installation, configuration, maintenance, upgrade, retirement).
- Proficiency in Cloud, Virtualization, and Container technologies.
- Deep understanding of operating systems, computer…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).