×
Register Here to Apply for Jobs or Post Jobs. X

Senior Infrastructure Engineer

Job in Zürich, 8058, Zurich, Kanton Zürich, Switzerland
Listing for: Cleeven group
Full Time position
Listed on 2026-02-07
Job specializations:
  • IT/Tech
    Machine Learning/ ML Engineer, Data Engineer, Systems Engineer, AI Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 CHF Yearly CHF 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Location: Zürich

Join a European applied machine learning team focused on building the next generation of large-scale training infrastructure for foundation models. You will contribute to the design and development of high-performance distributed systems that enable cutting-edge research in machine learning and reinforcement learning at scale.

The team’s mission is to create robust, efficient, and scalable training frameworks that accelerate experimentation and push the boundaries of model performance and system efficiency.

Key Responsibilities

As a senior member of the machine learning infrastructure team, you will work across the full training systems stack:

  • Design, develop, and scale distributed reinforcement learning training systems
  • Build high-performance RL pipelines supporting actor/learner architectures
  • Optimize large-scale training on accelerators (GPU/TPU), with a strong focus on JAX
  • Improve performance, reliability, reproducibility, and observability of training pipelines
  • Work on cluster-level orchestration, resource management, and large-scale execution
  • Collaborate closely with research and systems engineering teams to accelerate iteration
  • Identify and resolve performance bottlenecks across compute, memory, I/O, and compilation layers
  • Your work will directly impact the scalability, throughput, and stability of reinforcement learning experiments, enabling advances in agent reasoning, decision-making, and policy learning.
Profile

Required Qualifications
  • Master’s or PhD in Computer Science, Computer Engineering, or a closely related field
  • Proven experience designing, building, or maintaining large-scale machine learning training infrastructure
  • Strong proficiency in Python and experience with PyTorch and/or JAX
  • Hands-on experience running training workloads on GPU and/or TPU
  • Solid understanding of distributed systems concepts (parallelism, fault tolerance, synchronization)
Preferred Qualifications
  • Experience developing or optimizing reinforcement learning training loops or pipelines
  • Strong software engineering skills with a focus on performance, reliability, and debuggability
  • Deep understanding of PyTorch/JAX internals, XLA, and performance profiling on GPU/TPU
  • Expertise in distributed RL architectures (actor/learner, experience replay, parallel environments)
  • Experience designing training services, orchestration tools, or automated ML pipelines
  • Proven ability to diagnose performance bottlenecks in large-scale ML workloads
  • Experience with cloud-based clusters or specialized accelerators
  • Contributions to ML frameworks, distributed training libraries, or high-performance computing systems
  • Excellent communication and collaboration skills across research and engineering teams
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary