×
Register Here to Apply for Jobs or Post Jobs. X

Research Engineer - Distributed Training

Job in Brazil, Clay County, Indiana, 47834, USA
Listing for: CloudWalk, Inc.
Full Time, Apprenticeship/Internship position
Listed on 2025-12-27
Job specializations:
  • IT/Tech
    AI Engineer, Data Scientist, Machine Learning/ ML Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 100000 - 140000 USD Yearly USD 100000.00 140000.00 YEAR
Job Description & How to Apply Below

About Cloud Walk

Cloud Walk is building the intelligent infrastructure for the future of financial services. Powered by AI, blockchain, and thoughtful design, our systems serve millions of entrepreneurs across Brazil and the US every day.

Our AI team trains large-scale language models that power real products – from payment intelligence and credit scoring to on-device assistants for merchants.

About the Role

We’re looking for a Research Engineer to design, scale, and evolve Cloud Walk’s distributed training stack for large language models. You’ll work at the intersection of research and infrastructure – running experiments across Deep Speed, FSDP, Hugging Face Accelerate
, and emerging frameworks like Unsloth, Torch Titan, and Axolotl
.

You’ll own the full training lifecycle: from cluster orchestration and data streaming to throughput optimization and checkpointing  you enjoy pushing the limits of GPUs, distributed systems, and next-generation training frameworks, this role is for you.

Responsibilities
  • Design, implement, and maintain Cloud Walk’s distributed LLM training pipeline.
  • Orchestrate multi-node, multi-GPU runs across Kubernetes and internal clusters.
  • Optimize performance, memory, and cost across large training workloads.
  • Integrate cutting-edge frameworks (Unsloth, Torch Titan, Axolotl) into production workflows.
  • Build internal tools and templates that accelerate research-to-production transitions.
  • Collaborate with infra, research, and MLOps teams to ensure reliability and reproducibility.
Requirements
  • Strong background in Py Torch and distributed training (Deep Speed, FSDP, Accelerate).
  • Hands‑on experience with large-scale multi‑GPU or multi‑node training.
  • Familiarity with Transformers, Datasets, and mixed‑precision techniques
    .
  • Understanding of GPUs, containers, and schedulers (Kubernetes, Slurm).
  • Mindset for reliability, performance, and clean engineering.
Bonus
  • Experience with Ray, MLflow, or W&B.
  • Knowledge of ZeRO, model parallelism, or pipeline parallelism
    .
  • Curiosity for emerging open‑source stacks like Unsloth, Torch Titan, and Axolotl
    .
Our process is simple

a deep conversation on distributed systems and LLM training, and a cultural interview.

Compensation

Competitive salary, equity, and the opportunity to shape the next generation of large‑scale AI infrastructure at Cloud Walk.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary