Staff Engineer,Distributed Storage and HPC & AI Infrastructure Job San Francisco area,California USA,IT/Tech

About the Role

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world's largest AI training and inference workloads. You'll architect high-performance parallel file systems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing.

You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you'll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel file systems for AI workloads.

Responsibilities

* Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing).

* Design/optimize RDMA, Infini Band, 400

GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage.

* Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns.

* Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel file systems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes.

* Build multi-tier caches (local NVMe, distributed, object); optimize data locality and model-weight distribution; implement smart prefetching/eviction.

* Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation.

* Partner with ML/SRE teams; mentor on storage best practices; contribute to open-source; write docs, postmortems, and public learnings.

Requirements

* 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale

* Proven track record deploying and operating high-performance storage for GPU/HPC clusters

* Deep Kubernetes and cloud-native storage experience in production environments

* Strong coding skills in Go and Python with demonstrated ability to build production-grade tools

* BS/MS in Computer Science, Engineering, or equivalent practical experience

* History of technical leadership: designing systems that significantly improved performance (3x), reliability (99.9%+ uptime), or cost

* efficiency

* Distributed Storage Systems:
Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel file systems at multi-petabyte scale

* Object Storage:
Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management

* Kubernetes Storage: CSI drivers, Stateful Sets, Persistent Volumes, storage operators, and custom controllers

* Storage optimization for GPU workloads, RDMA/Infini Band networking, parallel file system optimization (100+ GB/s aggregate cluster throughput)

* Programming:
Go and Python for automation, operators, and tooling

* Infrastructure as Code:
Terraform, Ansible, Helm, Git Ops (ArgoCD)

* Linux Storage Stack:
Advanced knowledge of file systems (ext4, xfs), LVM, NVMe optimization, RAID configurations

* Observability:
Prometheus, Grafana, Thanos architecture and operations

Nice to Have Skills

* GPU Direct Storage (GDS), NVMe-oF, storage networking (100

GbE/400

GbE)

* ML/AI storage patterns (model weights, checkpointing, dataset caching)

* Kubernetes operator development (controller-runtime, kubebuilder)

* Storage snapshots, cloning, and thin provisioning

* Backup and disaster recovery (Velero, Restic, cross-region replication)

* Storage encryption (at-rest and in-transit), security and compliance

* Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive…

Staff Engineer, Distributed Storage and HPC & AI Infrastructure