Platform Architect Job San Francisco area,California USA,IT/Tech

Position: Platform Support Architect

Overview

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. Data Direct Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing.

“DDN's A3I solutions are transforming the landscape of AI infrastructure.” – IDC

“The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA

DDN is the global leader in AI and multi-cloud data management cutting‑edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data‑intensive workloads with confidence.

Our success is driven by our unwavering commitment to innovation, customer‑centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.

Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

Job Description

DDN is expanding our Enterprise and Sovereign AI Solution offerings, for example Hyperpod – a turnkey NVIDIA AI Data Platform built on DDN Infinia storage, NVIDIA AI Enterprise (NVAIE), and Supermicro reference hardware, optimized for inference and RAG workloads. Our support organization is deep on storage (Infinia, EXAScaler); we are now hiring an AI platform specialist to lead supportability and enablement for the AI side of the stack – NVIDIA AI Enterprise services (NIMs, NeMo, Triton, GPU Operator, licensing), vector databases (initially Milvus), RAG/agentic workflows, and the high‑performance storage and networking fabric that underpins them.

You will be a trusted technical advisor within Support and across OEM and NVIDIA partner teams, combining the mindset of a solutions architect (architecture, reference patterns, PoCs, reusable assets) with that of a L3 support engineer. You’ll help DDN and our partners operate AI Data solutions as a cohesive AI platform, not just a collection of components.

Key Responsibilities Platform support

Act as the primary NVIDIA AI Enterprise and vector database solutions expert for Hyper

POD customer environments, bringing deep knowledge of NVAIE services (e.g., NIMs, NeMo, Triton, Tensor

RT/Tensor

RT‑LLM, GPU Operator, licensing/NLS) and vector databases (e.g., Milvus) to guide diagnosis, optimization, and solution design.
Own complex end‑to‑end triage across GPU, NVAIE services, vector DB, Kubernetes, Docker, high‑speed networking, and Infinia storage, distinguishing product defects from environmental and integration issues.
Diagnose and resolve performance bottlenecks in RAG and agentic AI workflows, from model selection and prompt/RAG configuration throughto vector search, GPU utilization, and data access patterns.
Collect and interpret logs and telemetry across Linux, containers, Kubernetes, GPU stack, vector DB, and storage/networking; build minimal repros and high‑quality defect reports for escalation to NVIDIA, vector‑DB vendors, OEMs, and internal engineering.

Runbooks, diagnostics, and supportability

Author and maintain support triage runbooks and checklists for Hyper

POD covering NVAIE services, Milvus/vector DB, GPU stack, Docker, Kubernetes resources, and their interaction with Infinia and the network fabric.
Define and validate unified diagnostics bundles that capture the right logs/configs/metrics from all relevant layers (Infinia, GPUs, NVAIE, Milvus, Kubernetes, network) to enable fast problem isolation and high‑signal escalations.
Collaborate…