ML Systems Engineer, Infrastructure & Cloud
Listed on 2026-01-11
-
IT/Tech
Systems Engineer, Data Engineer
About Basis
Basis is a nonprofit applied AI research organization with two mutually reinforcing goals.
The first is to understand and build intelligence. This means to establish the mathematical principles of what it means to reason, to learn, to make decisions, to understand, and to explain; and to construct software that implements these principles.
The second is to advance society’s ability to solve intractable problems
. This means expanding the scale, complexity, and breadth of problems that we can solve today, and even more importantly, accelerating our ability to solve problems in the future.
To achieve these goals, we’re building both a new technological foundation that draws inspiration from how humans reason, and a new kind of collaborative organization that puts human values first.
About the RoleML Systems Engineers at Basis ensure training and evaluation infrastructure is fast, reliable, and scalable. You will own the full stack from distributed training frameworks through cloud administration, making it possible for researchers to iterate quickly on complex models while managing computational resources efficiently.
We are looking for engineers who combine deep understanding of ML systems with operational excellence. The ideal ML Systems Engineer has experience with distributed training at scale, understands the intricacies of debugging numerical instabilities, and can manage cloud infrastructure that scales from experiments to production. You will be the guardian of training stability, the optimizer of compute costs, and the enabler of reproducible research.
This role spans traditional ML engineering and cloud/Dev Ops responsibilities. You will manage GPU clusters, optimize cloud spending, ensure security and compliance, and build the infrastructure that lets researchers focus on algorithms rather than operations.
We seek individuals who aspire to build robust ML infrastructure, maintain “logbook culture” for documenting issues and solutions, and treat operational excellence as a first-class concern.
We expect you to:Have demonstrated expertise in ML systems engineering
. Examples include:Managing distributed training jobs across hundreds of GPUs
Debugging and fixing numerical instabilities in large-scale training
Building infrastructure for reproducible ML experiments
Optimizing training throughput and resource utilization
Possess deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems.
Have strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements.
Understand the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to high-level training loops and evaluation pipelines.
Be skilled at debugging complex failures across the stack—GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, and convergence problems.
Value documentation and knowledge sharing
. You maintain comprehensive logs of issues encountered, solutions found, and lessons learned, building institutional knowledge.Progress with autonomy while coordinating closely with researchers. You can anticipate infrastructure needs, prevent problems before they occur, and respond quickly when issues arise.
In addition, the following would be an advantage:
Experience at organizations training large models (OpenAI, Anthropic, Google, Meta).
Background in both ML research and production systems.
Contributions to ML frameworks or distributed training libraries.
Experience with on‑premise GPU cluster management.
Knowledge of optimization theory and numerical methods.
Understanding of robotics‑specific infrastructure requirements.
Own distributed training infrastructure including job launchers, checkpointing systems, recovery mechanisms, and monitoring that ensures experiments run reliably at scale.
Debug and resolve training failures by diagnosing issues across GPUs, networking, numerics, and data…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).