Machine Learning Systems Engineer
Berkeley, Alameda County, California, 94709, USA
Listed on 2025-12-04
-
Software Development
AI Engineer, Machine Learning/ ML Engineer
Who We Are
At Relational
AI, we are building the future of intelligent data systems through our cloud-native relational knowledge graph management system—a platform designed for learning, reasoning, and prediction.
We are a remote-first, globally distributed team with colleagues across six continents. From day one, we’ve embraced asynchronous collaboration and flexible schedules, recognizing that innovation doesn’t follow a 9-to-5.
We are committed to an open, transparent, and inclusive workplace. We value the unique backgrounds of every team member and believe in fostering a culture of respect, curiosity, and innovation. We support each other’s growth and success—and take the well‑being of our colleagues seriously. We encourage everyone to find a healthy balance that affords them a productive, happy life, wherever they choose to live.
We bring together engineers who love building core infrastructure, obsess over developer experience, and want to make complex systems scalable, observable, and reliable.
Machine Learning Systems EngineerLocation: Remote (San Francisco Bay Area / North America / South America)
Experience Level: 3+ years of experience in machine learning engineering or research
About ScalarLMThis role will involve heavily working with the Scalar
LM framework and team.
Scalar
LM unifies vLLM, Megatron-LM, and Hugging Face for fast LLM training, inference, and self‑improving agents—all via an OpenAI‑compatible interface. Scalar
LM builds on top of the vLLM inference engine, the Megatron‑LM training framework, and the Hugging Face model hub. It unifies the capabilities of these tools into a single platform, enabling users to easily perform LLM inference and training, and build higher‑lever applications such as Agents with a twist - they can teach themselves new abilities via back propagation.
Scalar
LM is inspired by the work of Seymour Roger Cray (September 28, 1925 – October 5, 1996), an American electrical engineer and supercomputer architect who designed a series of computers that were the fastest in the world for decades, and founded Cray Research, which built many of these machines. Called "the father of supercomputing", Cray has been credited with creating the supercomputer industry.
It is a fully open source project (CC‑0 Licensed) focused on democratizing access to cutting‑edge LLM infrastructure that combines training and inference in a unified platform, enabling the development of self‑improving AI agents similar to Deep Seek R1.
Scalar
LM is supported and maintained by Tensor Wave in addition to Relational
AI.
As a Machine Learning Engineer, you will contribute directly to our machine learning infrastructure, to the Scalar
LM open source codebase, and build large‑scale language model applications on top of it. You’ll operate at the intersection of high-performance computing, distributed systems, and cutting‑edge machine learning research, developing the fundamental infrastructure that enables researchers and organizations worldwide to train and deploy large language models at scale.
This is an opportunity to take on technically demanding projects, contribute to foundational systems, and help shape the next generation of intelligent computing.
You Will- Contribute code and performance improvements to the open source project.
- Develop and optimize distributed training algorithms for large language models.
- Implement high‑performance inference engines and optimization techniques.
- Work on integration between vLLM, Megatron‑LM, and Hugging Face ecosystems.
- Build tools for seamless model training, fine‑tuning, and deployment.
- Optimize performance of advanced GPU architectures.
- Collaborate with the open source community on feature development and bug fixes.
- Research and implement new techniques for self‑improving AI agents.
- Programming
Languages:
Proficiency in both C/C++ and Python - High Performance Computing:
Deep understanding of HPC concepts, including:- MPI (Message Passing Interface) programming and optimization
- Bulk Synchronous Parallel (BSP) computing models
- Multi‑GPU and multi‑node distributed computing
- CUDA/ROCm programming experience preferred
- Machine Learning…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).