Member of Technical Staff - Foundations
Job in
Zurich, Blaine County, Montana, 59547, USA
Listed on 2026-06-02
Listing for:
Tzafon
Full Time
position Listed on 2026-06-02
Job specializations:
-
Software Development
Machine Learning/ ML Engineer, AI Engineer, Data Scientist, Data Engineer
Job Description & How to Apply Below
Tzafon is a foundation model lab building scalable compute systems and advancing machine intelligence, with offices in San Francisco, Zurich & Tel Aviv. We've raised over $12m in funding to advance our mission of expanding the frontiers of machine intelligence.
We're a team of engineers and scientists with deep backgrounds in ML infrastructure & research. Founded by IOI and IMO medalists, PhDs, and alumni from leading tech companies, such as Google Deepmind, Character, and NVIDIA, we train models and build infrastructure for swarms of agents to automate work across real-world environments.
You'll work between our product and post-training teams to ship Large Action Models that actually work. Build evals, benchmarks, and fine-tuning pipelines. Define what good model behavior means and make it happen at scale.
What you'll do
- Design and execute large scale training runs on our clusters
- Build and optimize distributed training infrastructure across massive multi-node systems
- Implement post-training pipelines at scale
- Develop data pipelines that process and filter trillions of tokens for pre-training
- Research and implement architectural improvements, scaling laws, and training optimizations
- Debug training instabilities, loss spikes, and convergence issues in long-running jobs
- Build tooling for cluster utilization, fault tolerance, and checkpoint management
- Write custom CUDA/Triton kernels to optimize critical training operations (attention, normalization, activations)
- Collaborate on research that advances the state of the art in foundation model training
- Deep experience pre-training or post-training foundation models on large clusters
- Expert-level at Python and ML frameworks (PyTorch, JAX, Torchtitan)
- Strong systems skills: distributed training, FSDP/ZeRO, tensor parallelism, pipeline parallelism
- Experience writing performant CUDA or Triton kernels for ML workloads
- Track record of running stable multi-week training jobs and debugging distributed training failures
- Understanding of cluster scheduling, networking bottlenecks, and GPU/TPU performance optimization
- Trained foundation models at major AI labs (OpenAI, Anthropic, Google Deep Mind, Meta, xAI, etc.)
- Worked on large scale RL runs
- Optimized critical training kernels (Flash Attention, fused optimizers, custom kernels)
- Published research at top ML conferences (NeurIPS, ICML, ICLR)
- Contributions to open source ML infrastructure (PyTorch, JAX, vLLM, etc.)
- Experience with training data pipelines, data quality research, or synthetic data generation
- Full medical, dental, and vision coverage, plus 401(k) in the us
- Office in SF, Zurich, and Tel Aviv
- Early-stage equity in a future-defining company
Compensation starts at $200k-$500k + equity package, depending on experience & location.
We also offer a referral bonus of $5k for referral of successful hires (send to careers).
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×