Senior ML Engineer — Distributed LLM Training Infrastructure Job Jamestown area,Town of Poland New York USA,IT/Tech

Location: Town of Poland

About Templar

Templar is at the forefront of community-driven AI development, redefining how large language models (LLMs) are trained. Our team enables permissionless pretraining
, allowing collaborators across diverse computational environments to jointly train LLMs without centralized coordination. Our latest research, Incentivizing Permissionless Distributed Learning of LLMs, introduces Gauntlet
—an incentive system deployed on-chain that powered a truly decentralized 1.2B parameter LLM training run. You can read the paper on arXiv here:
Incentivizing Permissionless Distributed Learning of LLMs.

Role Overview

We’re looking for a seasoned Senior ML Engineer to architect and scale the infrastructure that enables distributed LLM training will design robust systems atop existing frameworks, extend permissionless protocols, and optimize for decentralized environments across heterogeneous hardware.

Responsibilities

Distributed Training Infrastructure

Architect scalable training across frameworks like Torch Titan, Megatron-LM, Deep Speed, Fair Scale
Implement model/data/pipeline parallelism, efficient gradient sync, all-reduce in heterogeneous clusters
Build fault‑tolerant systems including checkpointing and node‑failure recovery
Optimize memory usage and GPU operations with custom CUDA kernels

Framework Development & Optimization

Extend frameworks to enable multi‑party permissionless training
Implement optimization features like gradient compression, quantization, sparsification
Build resilient communication backends for high‑latency, unreliable networks
Develop resource managers, schedulers, and profiling tools for distributed training

System Architecture & Scaling

Architect full training pipelines (data ingestion ➜ deployment)
Build containerized systems via Kubernetes/Docker across cloud platforms
Design model sharding for 100B+ parameter spaces
Implement CI/CD pipelines for distributed infrastructure

Performance Engineering

Profile and optimize throughput, memory, communication patterns
Leverage mixed‑precision, gradient accumulation, fused kernels
Build benchmarking and performance regression suites

Required Qualifications

Bachelor’s or Master’s in CS, Engineering, or related field
5+ years of experience with large‑scale distributed systems / HPC
Deep expertise with distributed LLM frameworks (Torch Titan, Megatron‑LM, Deep Speed, Fair Scale)
Expert‑level PyTorch knowledge and hands‑on with DDP/FSDP/RPC
Strong experience in Python and C++/CUDA systems programming
Familiarity with Kubernetes, Docker, and cloud platforms (AWS/GCP/Azure)
Proven track record scaling ML training workloads efficiently

Preferred Experience

Training models >10B parameters via model parallelism
CUDA and GPU optimization for deep learning workloads
Proficiency with NCCL, MPI, high‑throughput data pipelines
Familiarity with decentralized systems, P2P, or blockchain technologies
Contributions to open‑source ML/distributed training projects

Immediates (0–6 Mo)

Benchmark distributed frameworks for permissionless setups
Build proof‑of‑concept infrastructure: gradient compression, node management, fault tolerance
Develop performance benchmarking and observability tools
Collaborate with research teams to transition algorithms to production

Longer‑Term (6+ Mo)

Lead next‑gen distributed training platform support for 1000+ participants
Implement advanced CUDA/kernel optimizations
Release SDKs/APIs for permissionless participation
Establish performance benchmarks and open‑source infrastructure components

Why Join Templar?

Your work will directly enable the democratization of LLM training—making large‑scale models accessible across the globe, regardless of centralized resources. You’ll be central to building the distributed systems that support permissionless AI innovation.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language