More jobs:
Senior Software Engineer - Distributed Training
Job in
Palo Alto, Santa Clara County, California, 94306, USA
Listed on 2026-02-17
Listing for:
Clockwork Systems
Apprenticeship/Internship
position Listed on 2026-02-17
Job specializations:
-
IT/Tech
Systems Engineer, AI Engineer
Job Description & How to Apply Below
Clockwork.io - Software Driven Fabrics to increase GPU cluster utilization
Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI fabrics by delivering cross-stack observability to catch and quickly resolve problems, workload fault tolerance to keep jobs running through failures, and performance acceleration that dynamically routes and paces traffic to avoid congestion.
To learn more, visit (Use the "Apply for this Job" box below)..
About the Role
We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.
You'll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.
What You will do
* Develop and support distributed PyTorch training jobs using torch.distributed / c10d
* Integrate and maintain frameworks like Megatron-LM, Deep Speed, and related LLM training stacks
* Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
* Optimize performance across communication, I/O, and memory bottlenecks
* Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
* Write tooling and scripts to streamline training workflows and experiment management
* Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)
What We're Looking For
* Deep experience with PyTorch and torch.distributed (c10d)
* Hands-on experience with at least one of:
Megatron-LM, Deep Speed, or Fair Scale
* Proficiency in Python and Linux shell scripting
* Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
* Strong understanding of NCCL, collective communication, and GPU topology
* Familiarity with debugging tools and techniques for distributed systems
Preferred Skills
* Experience scaling LLM training across 8+ GPUs and multiple nodes
* Knowledge of tensor, pipeline, and data parallelism
* Familiarity with containerized training environments (Docker, Singularity)
* Exposure to HPC environments or cloud GPU infrastructure
* Experience with training workload orchestration tools or custom job launchers
* Comfort with large-scale checkpointing, resume/restart logic, and model I/O
Bonus Skills
* Profiling tools:
PyTorch Profiler, Nsight, nvprof, or equivalent
* Experience with performance tuning in distributed training environments
* Contributions to ML infrastructure open-source projects
* Familiarity with storage, networking, or RDMA/GPU Direct technologies
* Understanding of observability in ML pipelines (metrics, logs, dashboards)
Enjoy
* Challenging projects.
* A friendly and inclusive workplace culture.
* Competitive compensation.
* A great benefits package.
* Catered lunch
Compensation for this position will vary based on the skills and experience you bring, as well as internal equity considerations. For candidates hired at the posted level, the expected base salary range is $150,000 - $230,000. The offered compensation package may also include stock options or other equity awards, subject to Clockwork's equity program and applicable approvals.
Clockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright, passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity or expression, national origin, disability, or protected veteran status. We believe diversity drives innovation, and we grow stronger together.
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×