Senior ML Engineer — Distributed LLM Training Infrastructure
Listed on 2026-01-07
-
IT/Tech
AI Engineer, Machine Learning/ ML Engineer
About Templar
Templar is at the forefront of community-driven AI development, redefining how large language models (LLMs) are trained. Our team enables permissionless pretraining
, allowing collaborators across diverse computational environments to jointly train LLMs without centralized coordination. Our latest research, Incentivizing Permissionless Distributed Learning of LLMs, introduces Gauntlet
—an incentive system deployed on-chain that powered a truly decentralized 1.2B parameter LLM training run. You can read the paper on arXiv here:
Incentivizing Permissionless Distributed Learning of LLMs.
We’re looking for a seasoned Senior ML Engineer to architect and scale the infrastructure that enables distributed LLM training will design robust systems atop existing frameworks, extend permissionless protocols, and optimize for decentralized environments across heterogeneous hardware.
ResponsibilitiesDistributed Training Infrastructure
- Architect scalable training across frameworks like Torch Titan, Megatron-LM, Deep Speed, Fair Scale
- Implement model/data/pipeline parallelism, efficient gradient sync, all-reduce in heterogeneous clusters
- Build fault‑tolerant systems including checkpointing and node‑failure recovery
- Optimize memory usage and GPU operations with custom CUDA kernels
Framework Development & Optimization
- Extend frameworks to enable multi‑party permissionless training
- Implement optimization features like gradient compression, quantization, sparsification
- Build resilient communication backends for high‑latency, unreliable networks
- Develop resource managers, schedulers, and profiling tools for distributed training
System Architecture & Scaling
- Architect full training pipelines (data ingestion ➜ deployment)
- Build containerized systems via Kubernetes/Docker across cloud platforms
- Design model sharding for 100B+ parameter spaces
- Implement CI/CD pipelines for distributed infrastructure
Performance Engineering
- Profile and optimize throughput, memory, communication patterns
- Leverage mixed‑precision, gradient accumulation, fused kernels
- Build benchmarking and performance regression suites
- Bachelor’s or Master’s in CS, Engineering, or related field
- 5+ years of experience with large‑scale distributed systems / HPC
- Deep expertise with distributed LLM frameworks (Torch Titan, Megatron‑LM, Deep Speed, Fair Scale)
- Expert‑level PyTorch knowledge and hands‑on with DDP/FSDP/RPC
- Strong experience in Python and C++/CUDA systems programming
- Familiarity with Kubernetes, Docker, and cloud platforms (AWS/GCP/Azure)
- Proven track record scaling ML training workloads efficiently
- Training models >10B parameters via model parallelism
- CUDA and GPU optimization for deep learning workloads
- Proficiency with NCCL, MPI, high‑throughput data pipelines
- Familiarity with decentralized systems, P2P, or blockchain technologies
- Contributions to open‑source ML/distributed training projects
- Benchmark distributed frameworks for permissionless setups
- Build proof‑of‑concept infrastructure: gradient compression, node management, fault tolerance
- Develop performance benchmarking and observability tools
- Collaborate with research teams to transition algorithms to production
- Lead next‑gen distributed training platform support for 1000+ participants
- Implement advanced CUDA/kernel optimizations
- Release SDKs/APIs for permissionless participation
- Establish performance benchmarks and open‑source infrastructure components
Your work will directly enable the democratization of LLM training—making large‑scale models accessible across the globe, regardless of centralized resources. You’ll be central to building the distributed systems that support permissionless AI innovation.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).