Principal Solutions Architect; San Ramon,CA Job San Ramon area,California USA,IT/Tech

Position: Principal Solutions Architect (Req#1048) San Ramon, CA

Principal Solutions Architect (Req#1048)

San Ramon, CA

Overview

We are seeking an elite Solutions Architect to lead the end-to-end design, sizing, and deployment of NVIDIA AI Factory‑aligned infrastructure. In this highly technical, customer‑facing role you will translate complex AI and machine learning workload requirements into fully engineered infrastructure solutions spanning colocation facilities, GPU compute, high‑performance networking, parallel storage, and the complete NVIDIA AI software stack.

You will serve as a trusted technical advisor to enterprise and hyperscale customers, partnering with sales, product, and engineering teams to win and deliver transformational AI infrastructure programs. Your expertise will directly shape how organizations build and operate production AI Factories capable of training frontier models, running large‑scale inference fleets, and accelerating data science pipelines at scale.

Your Impact

Lead discovery workshops to capture AI/ML workload requirements, including model training scale, inference SLAs, data pipeline throughput, and multi‑tenancy needs.
Architect full‑stack AI Factory solutions aligned to NVIDIA reference architectures, integrating colocation, GPU compute, networking, storage, and software layers.
Develop detailed Bills of Materials (BOMs), rack elevation diagrams, network topology drawings, and power/cooling budgets for customer proposals.
Define GPU cluster architectures using NVIDIA DGX, HGX, and MGX systems with B200, B300, and GB300 Blackwell SXM and NVLink‑Switch configurations.
Design RTX PRO 6000 Blackwell Server Edition deployments for inference‑optimized and enterprise AI workloads.
Conduct workload sizing and TCO/ROI modeling to validate infrastructure dimensioning for training, fine tuning, and inference at scale.

Colocation & Facility Planning

Specify colocation requirements including critical power load (MW‑scale), UPS and generator configurations, and PUE targets.
Design high‑density GPU deployments utilizing air‑cooled, direct liquid cooling (DLC), and rear‑door heat exchanger configurations.
Define meet‑me room (MMR) and cross‑connect requirements; specify carrier‑neutral telecom diversity strategies.
Engage colocation providers and data center operators to validate capacity availability and negotiate technical SLAs.
Coordinate with facilities and MEP engineers to validate power infrastructure from utility feed through PDU to rack level.

GPU Compute Infrastructure

Architect multi‑node GPU clusters optimized for large language model (LLM) pre‑training, fine‑tuning, and reinforcement learning from human feedback (RLHF).
Size and configure DGX Super

POD, HGX H/B‑series, and MGX modular systems based on model parameter count, dataset size, and iteration timelines.
Define server firmware, BIOS, BMC, and DGXOS baselines for production GPU infrastructure.
Establish GPU health monitoring, RAS (Reliability, Availability, Serviceability) policies, and lifecycle management procedures.

High‑Performance Networking

Design backend GPU fabric networks using NVIDIA Quantum Infini Band (NDR 400

Gb/s and HDR 200

Gb/s) for distributed training traffic.
Architect Spectrum‑X Ethernet‑based AI networking solutions for inference clusters requiring highbandwidth, low‑latency connectivity.
Specify Connect

X‑8/7 HCA deployments and configure RDMA over Converged Ethernet (RoCEv2) or Infini Band transport for NCCL collective operations.
Integrate Blue Field‑3 DPUs for GPU‑accelerated network functions, storage offload, zero‑trust security isolation, and bare‑metal provisioning.
Design leaf‑spine and fat‑tree topologies for non‑blocking bi sectional bandwidth in GPU training clusters.
Define Quality of Service (QoS) policies separating storage, compute fabric, and management plane traffic.
Design high‑performance parallel file system solutions using VAST Data, Hammerspace, and Pure Storage Flash Blade//E for AI training and checkpoint storage.
Size storage capacity, IOPS, and throughput based on dataset characteristics, checkpoint frequency, and concurrent reader/writer counts.
Architect multi‑tier storage hierarchies: hot NVMe flash (VAST/Flash Blade) for active datasets, warm…

Principal Solutions Architect; San Ramon, CA