More jobs:
LLM Pre-training & Distributed Engineer; AI Infrastructure
Job in
Oregon, Dane County, Wisconsin, 53575, USA
Listed on 2026-04-28
Listing for:
Hyphen Connect
Apprenticeship/Internship
position Listed on 2026-04-28
Job specializations:
-
Engineering
AI Engineer
Job Description & How to Apply Below
We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities:- Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, Deep Speed, or Megatron-LM.
- Optimize networking (Infini Band/RDMA) and memory management to prevent out-of-memory errors.
- Automate checkpointing and failure recovery during month-long training runs.
- Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
- Strong systems engineering background (C++, CUDA, Python).
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×