×
Register Here to Apply for Jobs or Post Jobs. X

Senior ML Engineer — Distributed LLM Training Infrastructure

Job in Town of Poland, Jamestown, Chautauqua County, New York, 14701, USA
Listing for: Rethink recruit
Apprenticeship/Internship position
Listed on 2026-01-07
Job specializations:
  • IT/Tech
    AI Engineer, Machine Learning/ ML Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Location: Town of Poland

About Templar

Templar is at the forefront of community-driven AI development, redefining how large language models (LLMs) are trained. Our team enables permissionless pretraining
, allowing collaborators across diverse computational environments to jointly train LLMs without centralized coordination. Our latest research, Incentivizing Permissionless Distributed Learning of LLMs, introduces Gauntlet
—an incentive system deployed on-chain that powered a truly decentralized 1.2B parameter LLM training run. You can read the paper on arXiv here:
Incentivizing Permissionless Distributed Learning of LLMs.

Role Overview

We’re looking for a seasoned Senior ML Engineer to architect and scale the infrastructure that enables distributed LLM training  will design robust systems atop existing frameworks, extend permissionless protocols, and optimize for decentralized environments across heterogeneous hardware.

Responsibilities

Distributed Training Infrastructure

  • Architect scalable training across frameworks like Torch Titan, Megatron-LM, Deep Speed, Fair Scale
  • Implement model/data/pipeline parallelism, efficient gradient sync, all-reduce in heterogeneous clusters
  • Build fault‑tolerant systems including checkpointing and node‑failure recovery
  • Optimize memory usage and GPU operations with custom CUDA kernels

Framework Development & Optimization

  • Extend frameworks to enable multi‑party permissionless training
  • Implement optimization features like gradient compression, quantization, sparsification
  • Build resilient communication backends for high‑latency, unreliable networks
  • Develop resource managers, schedulers, and profiling tools for distributed training

System Architecture & Scaling

  • Architect full training pipelines (data ingestion ➜ deployment)
  • Build containerized systems via Kubernetes/Docker across cloud platforms
  • Design model sharding for 100B+ parameter spaces
  • Implement CI/CD pipelines for distributed infrastructure

Performance Engineering

  • Profile and optimize throughput, memory, communication patterns
  • Leverage mixed‑precision, gradient accumulation, fused kernels
  • Build benchmarking and performance regression suites
Required Qualifications
  • Bachelor’s or Master’s in CS, Engineering, or related field
  • 5+ years of experience with large‑scale distributed systems / HPC
  • Deep expertise with distributed LLM frameworks (Torch Titan, Megatron‑LM, Deep Speed, Fair Scale)
  • Expert‑level PyTorch knowledge and hands‑on with DDP/FSDP/RPC
  • Strong experience in Python and C++/CUDA systems programming
  • Familiarity with Kubernetes, Docker, and cloud platforms (AWS/GCP/Azure)
  • Proven track record scaling ML training workloads efficiently
Preferred Experience
  • Training models >10B parameters via model parallelism
  • CUDA and GPU optimization for deep learning workloads
  • Proficiency with NCCL, MPI, high‑throughput data pipelines
  • Familiarity with decentralized systems, P2P, or blockchain technologies
  • Contributions to open‑source ML/distributed training projects
Immediates (0–6 Mo)
  • Benchmark distributed frameworks for permissionless setups
  • Build proof‑of‑concept infrastructure: gradient compression, node management, fault tolerance
  • Develop performance benchmarking and observability tools
  • Collaborate with research teams to transition algorithms to production
Longer‑Term (6+ Mo)
  • Lead next‑gen distributed training platform support for 1000+ participants
  • Implement advanced CUDA/kernel optimizations
  • Release SDKs/APIs for permissionless participation
  • Establish performance benchmarks and open‑source infrastructure components
Why Join Templar?

Your work will directly enable the democratization of LLM training—making large‑scale models accessible across the globe, regardless of centralized resources. You’ll be central to building the distributed systems that support permissionless AI innovation.

#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary