Principal Solutions Architect; San Ramon, CA
Listed on 2026-06-05
-
IT/Tech
Systems Engineer, Data Engineer
Principal Solutions Architect (Req#1048)
San Ramon, CA
OverviewWe are seeking an elite Solutions Architect to lead the end-to-end design, sizing, and deployment of NVIDIA AI Factory‑aligned infrastructure. In this highly technical, customer‑facing role you will translate complex AI and machine learning workload requirements into fully engineered infrastructure solutions spanning colocation facilities, GPU compute, high‑performance networking, parallel storage, and the complete NVIDIA AI software stack.
You will serve as a trusted technical advisor to enterprise and hyperscale customers, partnering with sales, product, and engineering teams to win and deliver transformational AI infrastructure programs. Your expertise will directly shape how organizations build and operate production AI Factories capable of training frontier models, running large‑scale inference fleets, and accelerating data science pipelines at scale.
Your Impact- Lead discovery workshops to capture AI/ML workload requirements, including model training scale, inference SLAs, data pipeline throughput, and multi‑tenancy needs.
- Architect full‑stack AI Factory solutions aligned to NVIDIA reference architectures, integrating colocation, GPU compute, networking, storage, and software layers.
- Develop detailed Bills of Materials (BOMs), rack elevation diagrams, network topology drawings, and power/cooling budgets for customer proposals.
- Define GPU cluster architectures using NVIDIA DGX, HGX, and MGX systems with B200, B300, and GB300 Blackwell SXM and NVLink‑Switch configurations.
- Design RTX PRO 6000 Blackwell Server Edition deployments for inference‑optimized and enterprise AI workloads.
- Conduct workload sizing and TCO/ROI modeling to validate infrastructure dimensioning for training, fine tuning, and inference at scale.
- Specify colocation requirements including critical power load (MW‑scale), UPS and generator configurations, and PUE targets.
- Design high‑density GPU deployments utilizing air‑cooled, direct liquid cooling (DLC), and rear‑door heat exchanger configurations.
- Define meet‑me room (MMR) and cross‑connect requirements; specify carrier‑neutral telecom diversity strategies.
- Engage colocation providers and data center operators to validate capacity availability and negotiate technical SLAs.
- Coordinate with facilities and MEP engineers to validate power infrastructure from utility feed through PDU to rack level.
- Architect multi‑node GPU clusters optimized for large language model (LLM) pre‑training, fine‑tuning, and reinforcement learning from human feedback (RLHF).
- Size and configure DGX Super
POD, HGX H/B‑series, and MGX modular systems based on model parameter count, dataset size, and iteration timelines. - Define server firmware, BIOS, BMC, and DGXOS baselines for production GPU infrastructure.
- Establish GPU health monitoring, RAS (Reliability, Availability, Serviceability) policies, and lifecycle management procedures.
- Design backend GPU fabric networks using NVIDIA Quantum Infini Band (NDR 400
Gb/s and HDR 200
Gb/s) for distributed training traffic. - Architect Spectrum‑X Ethernet‑based AI networking solutions for inference clusters requiring highbandwidth, low‑latency connectivity.
- Specify Connect
X‑8/7 HCA deployments and configure RDMA over Converged Ethernet (RoCEv2) or Infini Band transport for NCCL collective operations. - Integrate Blue Field‑3 DPUs for GPU‑accelerated network functions, storage offload, zero‑trust security isolation, and bare‑metal provisioning.
- Design leaf‑spine and fat‑tree topologies for non‑blocking bi sectional bandwidth in GPU training clusters.
- Define Quality of Service (QoS) policies separating storage, compute fabric, and management plane traffic.
- Design high‑performance parallel file system solutions using VAST Data, Hammerspace, and Pure Storage Flash Blade//E for AI training and checkpoint storage.
- Size storage capacity, IOPS, and throughput based on dataset characteristics, checkpoint frequency, and concurrent reader/writer counts.
- Architect multi‑tier storage hierarchies: hot NVMe flash (VAST/Flash Blade) for active datasets, warm…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).