Senior Systems Engineer- Network Infrastructure
Listed on 2026-04-28
-
IT/Tech
Systems Engineer, Cloud Computing, Network Engineer, Data Engineer
Overview
We are building next-generation AI infrastructure from the ground up. Our mission is to deliver highly performant, reliable, and scalable network clusters purpose-built for large-scale AI training and inference.
The RoleWe are hiring a Senior Deployment Engineer to lead hands-on bringup of network clusters across our data center environments. You will own the execution of node, rack, and network deployment, ensuring clusters are validated, performant, and production-ready. This role is deeply technical and execution-focused. You will be in the details—cabling racks, validating firmware, tuning fabrics, debugging performance—and helping us build repeatable processes as we scale.
WhatYou’ll Do
- Execute end-to-end bringup of network nodes and racks from installation to production readiness.
- Validate BIOS/BMC/firmware configurations and network health.
- Perform rack-level integration including power, cabling, and airflow validation.
- Bring up and validate high-speed network fabrics (Infini Band, RoCE, 100–400G Ethernet).
- Configure and validate leaf/spine network connectivity.
- Run cluster-wide burn-in and stress testing.
- Validate node-to-node performance (NCCL, RDMA, GPUDirect).
- Troubleshoot hardware, firmware, and fabric-level issues.
- Contribute to automation for provisioning and cluster validation.
- Improve deployment playbooks and documentation.
- Identify reliability issues early and drive corrective actions.
- Help turn ad hoc deployments into repeatable systems.
- Work closely with networking, systems software, and data center teams.
- Coordinate with hardware vendors to resolve bringup issues.
- Support rapid capacity expansion as we scale.
Required
- 5–8+ years in infrastructure engineering, hardware deployment, or data center operations.
- Hands-on experience deploying network servers (HGX/DGX or similar platforms).
- Experience with high-speed networking (Infini Band, RoCE, Ethernet fabrics).
- Strong Linux systems knowledge.
- Experience troubleshooting distributed systems performance issues.
- Comfortable working onsite in data center environments as needed.
- Strongly Preferred
- Experience in AI/ML infrastructure or HPC environments.
- Familiarity with NCCL, CUDA, RDMA.
- Automation experience (Python, Ansible, Terraform, Bash).
- Experience in high-density power and cooling environments.
- Clusters are brought online quickly and correctly.
- Performance baselines meet or exceed expectations.
- Deployment processes become faster and more reliable over time.
- You help build the foundation for scaled infrastructure growth.
For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice:
Here.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).