Senior Infrastructure Engineer Job Zurich area,Zürich Kanton Zürich Switzerland,IT/Tech

Location: Zürich

Join a European applied machine learning team focused on building the next generation of large-scale training infrastructure for foundation models. You will contribute to the design and development of high-performance distributed systems that enable cutting-edge research in machine learning and reinforcement learning at scale.

The team’s mission is to create robust, efficient, and scalable training frameworks that accelerate experimentation and push the boundaries of model performance and system efficiency.

Key Responsibilities

As a senior member of the machine learning infrastructure team, you will work across the full training systems stack:

Design, develop, and scale distributed reinforcement learning training systems
Build high-performance RL pipelines supporting actor/learner architectures
Optimize large-scale training on accelerators (GPU/TPU), with a strong focus on JAX
Improve performance, reliability, reproducibility, and observability of training pipelines
Work on cluster-level orchestration, resource management, and large-scale execution
Collaborate closely with research and systems engineering teams to accelerate iteration
Identify and resolve performance bottlenecks across compute, memory, I/O, and compilation layers
Your work will directly impact the scalability, throughput, and stability of reinforcement learning experiments, enabling advances in agent reasoning, decision-making, and policy learning.

Profile

Required Qualifications

Master’s or PhD in Computer Science, Computer Engineering, or a closely related field
Proven experience designing, building, or maintaining large-scale machine learning training infrastructure
Strong proficiency in Python and experience with PyTorch and/or JAX
Hands-on experience running training workloads on GPU and/or TPU
Solid understanding of distributed systems concepts (parallelism, fault tolerance, synchronization)

Preferred Qualifications

Experience developing or optimizing reinforcement learning training loops or pipelines
Strong software engineering skills with a focus on performance, reliability, and debuggability
Deep understanding of PyTorch/JAX internals, XLA, and performance profiling on GPU/TPU
Expertise in distributed RL architectures (actor/learner, experience replay, parallel environments)
Experience designing training services, orchestration tools, or automated ML pipelines
Proven ability to diagnose performance bottlenecks in large-scale ML workloads
Experience with cloud-based clusters or specialized accelerators
Contributions to ML frameworks, distributed training libraries, or high-performance computing systems
Excellent communication and collaboration skills across research and engineering teams

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language