Principal Solutions Architect
Listed on 2026-05-30
-
IT/Tech
Systems Engineer, Data Engineering
Overview
We are seeking an elite Solutions Architect to lead the end-to-end design, sizing, and deployment of NVIDIA AI Factory-aligned infrastructure. In this highly technical, customer-facing role you will translate complex AI and machine learning workload requirements into fully engineered infrastructure solutions spanning colocation facilities, GPU compute, high-performance networking, parallel storage, and the complete NVIDIA AI software stack.
You will serve as a trusted technical advisor to enterprise and hyperscale customers, partnering with sales, product, and engineering teams to win and deliver transformational AI infrastructure programs. Your expertise will directly shape how organizations build and operate production AI Factories capable of training frontier models, running large-scale inference fleets, and accelerating data science pipelines at scale.
Your Impact Solution Design & Architecture- Lead discovery workshops to capture AI/ML workload requirements, including model training scale, inference SLAs, data pipeline throughput, and multi-tenancy needs.
- Architect full-stack AI Factory solutions aligned to NVIDIA reference architectures, integrating colocation, GPU compute, networking, storage, and software layers.
- Develop detailed Bills of Materials (BOMs), rack elevation diagrams, network topology drawings, and power/cooling budgets for customer proposals.
- Define GPU cluster architectures using NVIDIA DGX, HGX, and MGX systems with B200, B300, and GB300 Blackwell SXM and NVLink-Switch configurations.
- Design RTX PRO 6000 Blackwell Server Edition deployments for inference-optimized and enterprise AI workloads.
- Conduct workload sizing and TCO/ROI modeling to validate infrastructure dimensioning for training, fine tuning, and inference at scale.
- Specify colocation requirements including critical power load (MW-scale), UPS and generator configurations, and PUE targets.
- Design high-density GPU deployments utilizing air-cooled, direct liquid cooling (DLC), and rear-door heat exchanger configurations.
- Define meet-me room (MMR) and cross-connect requirements; specify carrier-neutral telecom diversity strategies.
- Engage colocation providers and data center operators to validate capacity availability and negotiate technical SLAs.
- Coordinate with facilities and MEP engineers to validate power infrastructure from utility feed through PDU to rack level.
- Architect multi-node GPU clusters optimized for large language model (LLM) pre-training, fine-tuning, and reinforcement learning from human feedback (RLHF).
- Size and configure DGX Super
POD, HGX H/B-series, and MGX modular systems based on model parameter count, dataset size, and iteration timelines. - Define server firmware, BIOS, BMC, and DGXOS baselines for production GPU infrastructure.
- Establish GPU health monitoring, RAS (Reliability, Availability, Serviceability) policies, and lifecycle management procedures.
- Design backend GPU fabric networks using NVIDIA Quantum Infini Band (NDR 400
Gb/s and HDR 200
Gb/s) for distributed training traffic. - Architect Spectrum-X Ethernet-based AI networking solutions for inference clusters requiring high bandwidth, low-latency connectivity.
- Specify Connect
X-8/7 HCA deployments and configure RDMA over Converged Ethernet (RoCEv2) or Infini Band transport for NCCL collective operations. - Integrate Blue Field-3 DPUs for GPU-accelerated network functions, storage offload, zero-trust security isolation, and bare-metal provisioning.
- Design leaf-spine and fat-tree topologies for non-blocking bi sectional bandwidth in GPU training clusters.
- Define Quality of Service (QoS) policies separating storage, compute fabric, and management plane traffic.
- Design high-performance parallel file system solutions using VAST Data, Hammerspace, and Pure Storage Flash Blade//E for AI training and checkpoint storage.
- Size storage capacity, IOPS, and throughput based on dataset characteristics, checkpoint frequency, and concurrent reader/writer counts.
- Architect multi-tier storage hierarchies: hot NVMe flash (VAST/Flash Blade) for active datasets, warm object storage for model archives, and cold tape/cloud for long-term retention.
- Configure VAST Data Universal Storage for disaggregated storage with NFS, S3, and POSIX access; tune for large sequential read performance.
- Deploy Hammerspace Global Data Environment for distributed data management and NFS-over-RDMA acceleration across geographically dispersed GPU clusters.
- Define data pipeline architectures ingesting from cloud object stores (S3, GCS, ABS) to local flash for GPU local data loading without I/O bottlenecks.
- Deploy and configure NVIDIA AI Enterprise (NVAIE) software stack including NVIDIA GPU Operator, NIM microservices, and RAPIDS accelerated data science libraries.
- Architect inference serving infrastructure using NVIDIA NIM (NVIDIA Inference Microservices) for optimized LLM and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).