×
Register Here to Apply for Jobs or Post Jobs. X

Machine Learning​/Reinforcement Learning Infrastructure Engineer

Job in Cambridge, Middlesex County, Massachusetts, 02140, USA
Listing for: Eka Robotics
Full Time position
Listed on 2026-05-06
Job specializations:
  • Software Development
    Software Engineer, Cloud Engineer - Software
Salary/Wage Range or Industry Benchmark: 125000 - 150000 USD Yearly USD 125000.00 150000.00 YEAR
Job Description & How to Apply Below
Position: Machine Learning / Reinforcement Learning Infrastructure Engineer

Eka Robotics

Eka Robotics is on a mission to build intelligence for the physical world - robots that are fast, general, and reliable. Our approach, grounded in physics, unlocks superhuman capabilities. We are defining the frontier of robotics research and deployment.

Our team consists of pioneers in robotics and machine learning. We are hiring to scale our R&D effort. We are looking for hands‑on individuals who are excited to help shape the future of robotics.

The Role

Reinforcement/Machine Learning Infrastructure Engineer to shape our training infrastructure.

In this role, you will be responsible for designing, implementing, and maintaining the large‑scale model training systems that power our next generation of robot learning. You will focus on building an exceptional developer experience, creating intuitive and efficient tooling that our engineers and scientists love to use. Your work will accelerate our research cycles, making it effortless to test new ideas and scale successful experiments into production training runs.

You will work closely with researchers to ensure our infrastructure scales seamlessly from prototyping to large‑scale distributed training. This is a hands‑on, high‑impact role at the intersection of machine learning, software engineering, and scalable infrastructure.

Responsibilities
  • Own Training Infrastructure:
    Design, implement, and maintain robust systems for large‑scale model training, including job orchestration, scheduling, checkpointing, and experiment tracking.
  • Developer Experience & Tooling:
    Build streamlined, intuitive abstractions for launching, monitoring, debugging, and reproducing experiments, minimizing friction and maximizing productivity for our research teams.
  • Scale Distributed Training:
    Work closely with researchers to reliably scale reinforcement learning and machine learning pipelines across compute clusters.
  • Resource Management:
    Ensure efficient allocation and utilization of cloud‑based compute resources while building the foundational systems needed for future scaling.
  • Collaborate with Researchers:
    Partner with the research team to understand their needs, build infrastructure that supports cutting‑edge methods, guide best practices for training at scale, and contribute to core JAX model and training code.
Minimum Qualifications
  • Education:

    BS, MS or higher in Computer Science, Computer Engineering, Machine Learning or a related technical field.
  • Software Engineering:
    Strong software engineering fundamentals with a proven track record of building ML training infrastructure, internal developer platforms, or scalable systems.
  • Deep Learning Frameworks:
    Hands‑on experience with large‑scale training using JAX (preferred), PyTorch, or Tensor Flow.
  • Distributed Systems:
    Familiarity with distributed training, multi‑host setups, data pipelines, and managing workloads on cloud platforms or orchestration systems (e.g., Kubernetes, SLURM, GCP, AWS).
  • Communication & Ownership:
    Strong cross‑functional communication skills, a deep ownership mindset, and a passion for building tools that improve the developer experience.
  • Infrastructure & Dev Ops:
    Experience building automated testing pipelines, CI/CD for ML workflows, and custom logging/telemetry stacks.
Preferred Qualifications
  • Domain

    Experience:

    Background in robotics, reinforcement learning or other machine learning systems.
  • Systems Design:
    Experience designing abstractions that balance researcher flexibility with system reliability.
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary