×
Register Here to Apply for Jobs or Post Jobs. X

Senior Storage System Engineer - Supercomputing

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: Institute of Foundation Models
Full Time position
Listed on 2025-12-28
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, Data Engineer, AI Engineer
Salary/Wage Range or Industry Benchmark: 200000 - 400000 USD Yearly USD 200000.00 400000.00 YEAR
Job Description & How to Apply Below

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

As a Storage Systems Engineer on the IFM Supercomputing Team
, you will design, build, and optimize high-performance storage systems to support some of the most advanced GPU supercomputing clusters in academia. These clusters power both AI training and inference workloads, requiring exceptional reliability, scalability, and low-latency data access.

Job Responsibilities
  • Architect and implement distributed and parallel file systems (e.g., Lustre, DDN, VAST) optimized for large-scale AI and HPC workloads.
  • Ensure seamless integration of storage with compute clusters managed by Slurm, Kubernetes and other orchestration systems.
  • Optimize I/O performance for high-throughput, low-latency access using modern storage technologies (NVMe, SSD).
    Parallel file systems.
  • Collaborate with infrastructure teams to enhance deployment pipelines using Infrastructure-as-Code (IaC) tools, ensuring reproducibility and reliability.
  • Monitor and maintain storage systems across on-premise and hybrid environments, proactively addressing performance bottlenecks and system failures.
  • Contribute to capacity planning, fault tolerance, and data durability strategies aligned with IFM’s growing computational demands.
Tech Stack
  • Lustre or similar parallel file systems.
  • Ceph, ZFS, Minio, S3, GCS, or similar distributed storage systems.
  • Slurm and Kubernetes or similar scheduler.
  • Pulumi, Terraform, Ansible
  • NVMe, SSD, HDD technologies
Professional Experience
  • Proven experience designing and operating large-scale distributed or parallel storage systems (e.g., Lustre, DDN, VAST, Ceph, ZFS) in HPC or AI environments.
  • Strong familiarity with storage hardware (NVMe, SSD, HDD) and performance tuning in high-throughput, compute-intensive clusters.
  • Experience working with Slurm and Kubernetes workload manager in production HPC environments.
  • Track record of working in large-scale supercomputing environments—ideally at national labs (e.g., LLNL, CSCS), top universities (e.g., Stanford), major tech firms (e.g., xAI, Meta, AWS), or enterprise vendors (e.g., NVIDIA, HPE, DDN).
  • Proficiency in developing storage-related tooling or monitoring solutions using Go or Rust.
  • Experience managing storage infrastructure via Infrastructure-as-Code (e.g., Terraform, Pulumi, Ansible).
  • Bonus:
    Familiarity with AI/ML data workflows and large-scale dataset handling.
Salary

$200,000 - $400,000 a year

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include
  • Comprehensive medical, dental, and vision benefits
  • Bonus
  • 401K Plan
  • Generous paid time off, sick leave and holidays
  • Paid Parental Leave
  • Employee Assistance Program
  • Life insurance and disability
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary