×
Register Here to Apply for Jobs or Post Jobs. X

Principal Solutions Architect; San Ramon, CA

Job in San Ramon, Contra Costa County, California, 94583, USA
Listing for: ePlus inc.
Full Time position
Listed on 2026-06-05
Job specializations:
  • IT/Tech
    Systems Engineer, Data Engineer
Salary/Wage Range or Industry Benchmark: 150000 - 200000 USD Yearly USD 150000.00 200000.00 YEAR
Job Description & How to Apply Below
Position: Principal Solutions Architect (Req#1048) San Ramon, CA

Principal Solutions Architect (Req#1048)

San Ramon, CA

Overview

We are seeking an elite Solutions Architect to lead the end-to-end design, sizing, and deployment of NVIDIA AI Factory‑aligned infrastructure. In this highly technical, customer‑facing role you will translate complex AI and machine learning workload requirements into fully engineered infrastructure solutions spanning colocation facilities, GPU compute, high‑performance networking, parallel storage, and the complete NVIDIA AI software stack.

You will serve as a trusted technical advisor to enterprise and hyperscale customers, partnering with sales, product, and engineering teams to win and deliver transformational AI infrastructure programs. Your expertise will directly shape how organizations build and operate production AI Factories capable of training frontier models, running large‑scale inference fleets, and accelerating data science pipelines at scale.

Your Impact
  • Lead discovery workshops to capture AI/ML workload requirements, including model training scale, inference SLAs, data pipeline throughput, and multi‑tenancy needs.
  • Architect full‑stack AI Factory solutions aligned to NVIDIA reference architectures, integrating colocation, GPU compute, networking, storage, and software layers.
  • Develop detailed Bills of Materials (BOMs), rack elevation diagrams, network topology drawings, and power/cooling budgets for customer proposals.
  • Define GPU cluster architectures using NVIDIA DGX, HGX, and MGX systems with B200, B300, and GB300 Blackwell SXM and NVLink‑Switch configurations.
  • Design RTX PRO 6000 Blackwell Server Edition deployments for inference‑optimized and enterprise AI workloads.
  • Conduct workload sizing and TCO/ROI modeling to validate infrastructure dimensioning for training, fine tuning, and inference at scale.
Colocation & Facility Planning
  • Specify colocation requirements including critical power load (MW‑scale), UPS and generator configurations, and PUE targets.
  • Design high‑density GPU deployments utilizing air‑cooled, direct liquid cooling (DLC), and rear‑door heat exchanger configurations.
  • Define meet‑me room (MMR) and cross‑connect requirements; specify carrier‑neutral telecom diversity strategies.
  • Engage colocation providers and data center operators to validate capacity availability and negotiate technical SLAs.
  • Coordinate with facilities and MEP engineers to validate power infrastructure from utility feed through PDU to rack level.
GPU Compute Infrastructure
  • Architect multi‑node GPU clusters optimized for large language model (LLM) pre‑training, fine‑tuning, and reinforcement learning from human feedback (RLHF).
  • Size and configure DGX Super

    POD, HGX H/B‑series, and MGX modular systems based on model parameter count, dataset size, and iteration timelines.
  • Define server firmware, BIOS, BMC, and DGXOS baselines for production GPU infrastructure.
  • Establish GPU health monitoring, RAS (Reliability, Availability, Serviceability) policies, and lifecycle management procedures.
High‑Performance Networking
  • Design backend GPU fabric networks using NVIDIA Quantum Infini Band (NDR 400

    Gb/s and HDR 200

    Gb/s) for distributed training traffic.
  • Architect Spectrum‑X Ethernet‑based AI networking solutions for inference clusters requiring highbandwidth, low‑latency connectivity.
  • Specify Connect

    X‑8/7 HCA deployments and configure RDMA over Converged Ethernet (RoCEv2) or Infini Band transport for NCCL collective operations.
  • Integrate Blue Field‑3 DPUs for GPU‑accelerated network functions, storage offload, zero‑trust security isolation, and bare‑metal provisioning.
  • Design leaf‑spine and fat‑tree topologies for non‑blocking bi sectional bandwidth in GPU training clusters.
  • Define Quality of Service (QoS) policies separating storage, compute fabric, and management plane traffic.
  • Design high‑performance parallel file system solutions using VAST Data, Hammerspace, and Pure Storage Flash Blade//E for AI training and checkpoint storage.
  • Size storage capacity, IOPS, and throughput based on dataset characteristics, checkpoint frequency, and concurrent reader/writer counts.
  • Architect multi‑tier storage hierarchies: hot NVMe flash (VAST/Flash Blade) for active datasets, warm…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary