×
Register Here to Apply for Jobs or Post Jobs. X

Senior ML Infrastructure Engineer; PyTorch, Kubernetes, GPU Training

Job in Redwood City, San Mateo County, California, 94061, USA
Listing for: Finoit Inc
Full Time, Apprenticeship/Internship position
Listed on 2026-07-04
Job specializations:
  • Software Development
    Machine Learning/ ML Engineer, Data Engineering
Salary/Wage Range or Industry Benchmark: 120000 - 160000 USD Yearly USD 120000.00 160000.00 YEAR
Job Description & How to Apply Below
Position: Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training)

Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training)

Short

Job Description

We are seeking a Senior ML Infrastructure Engineer to design and scale the infrastructure powering large-scale machine learning training workloads. In this role, you'll build high-performance GPU training platforms, optimize distributed training pipelines, and improve the developer experience for ML researchers.

Responsibilities
  • Design and scale distributed ML training infrastructure for large GPU clusters.
  • Build and optimize training pipelines using Py Torch ,
    Deep Speed
    , and distributed training frameworks.
  • Develop and maintain job scheduling systems using Kubernetes and/or SLURM
    .
  • Create high-throughput data pipelines for large-scale multimodal datasets.
  • Optimize GPU utilization, memory efficiency, and overall system performance.
  • Build low-latency inference pipelines for production ML deployments.
Required Skills
  • 7+ years of experience in ML Infrastructure, HPC, or Distributed Systems.
  • Strong experience with Py Torch ,
    Deep Speed
    , FSDP
    , ZeRO
    , or similar distributed training frameworks.
  • Hands-on experience with Kubernetes
    , cloud platforms (
    AWS/GCP
    ), and containerized environments.
  • Strong understanding of distributed systems, GPU optimization, NCCL, memory management, and performance tuning.
  • Experience building scalable ML infrastructure from development through production.

Location: Redwood City, CA (On-site)
Employment Type: Full-Time

Nice to Have
  • Experience with multimodal AI, robotics data pipelines, Triton, TensorRT, custom ML kernels, or ML compiler/runtime optimization.
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary