×
Register Here to Apply for Jobs or Post Jobs. X

LLM Pre-training & Distributed Engineer; AI Infrastructure

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Hyphen Connect Limited
Apprenticeship/Internship position
Listed on 2026-07-03
Job specializations:
  • Software Development
    Machine Learning/ ML Engineer, AI Engineer (Applied/Software)
Salary/Wage Range or Industry Benchmark: 120000 - 160000 USD Yearly USD 120000.00 160000.00 YEAR
Job Description & How to Apply Below
Position: LLM Pre-training & Distributed Engineer (AI Infrastructure)

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities
  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, Deep Speed, or Megatron-LM.
  • Optimize networking (Infini Band/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.
Required Skills
  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Strong systems engineering background (C++, CUDA, Python).
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary