Member of Technical Staff - Decentralized - Computing Leader
Listed on 2026-01-01
-
Software Development
AI Engineer, Machine Learning/ ML Engineer
Location: New York
Member of Technical Staff - Decentralized High-Performance Computing Leader
Software Engineer – AI Systems & Infrastructure
AboutThe Role
We’re seeking a highly skilled Software Engineer to design and build the systems that power next‑generation AI infrastructure. In this role, you’ll architect and develop the software that keeps large‑scale machine learning workloads running efficiently, enabling researchers and engineers to push the boundaries of what’s possible with modern AI.
You’ll collaborate closely with internal teams and customers to craft robust, scalable solutions for distributed computing, data management, and model training. Each day brings new challenges — from optimizing GPU utilization to creating smarter orchestration tools for massive compute clusters.
What You’ll Do- Design and enhance job scheduling systems to increase GPU efficiency and throughput for large‑scale machine learning workloads.
- Develop intuitive management interfaces and APIs that simplify cluster control and integration with frameworks like PyTorch, JAX, and Tensor Flow.
- Build observability and monitoring systems to track performance, utilization, and progress across vast distributed training environments.
- Streamline data pipelines to accelerate both model training and inference processes, ensuring smooth and reliable data flow.
- Integrate deeply with ML tooling such as MLflow, Kubeflow, and Weights & Biases, developing seamless services and connectors that enhance developer productivity.
- Write high‑performance libraries and internal utilities to automate deployment, scaling, and the management of distributed training workloads.
You’re passionate about building the backbone of large‑scale AI systems. You thrive in dynamic environments, enjoy solving deep technical problems, and have a track record of turning complex requirements into elegant, reliable code. You value clarity, teamwork, and the satisfaction that comes from shipping tools that others depend on daily.
What We Value- A customer‑focused mindset and the ability to turn user needs into thoughtful, scalable solutions.
- A drive to take initiative, act decisively, and deliver results without waiting for perfect conditions.
- Comfort working in ambiguous, fast‑evolving problem spaces with shifting priorities.
- Excellent communication skills and a collaborative approach that uplifts teammates and partners alike.
- Developed or optimized systems for training or serving large‑scale ML models, ideally across 1,000+ GPUs.
- Improved performance and efficiency of distributed training workflows spanning multiple nodes and accelerators.
- Built APIs, SDKs, or interfaces that simplify machine learning operations and enhance developer experience.
- Experience with cluster orchestration technologies such as Kubernetes or SLURM in the context of large‑scale ML workloads.
- Contributed to or worked with ML infrastructure tools such as Ray, Horovod, or Deep Speed, and have experience with workflow systems like MLflow, Kubeflow, or Weights & Biases.
AI development is only as powerful as the infrastructure behind it. This position offers the opportunity to shape the systems that drive some of the world’s most advanced machine learning workloads. You’ll help design the tools, frameworks, and services that define how AI at scale is trained, deployed, and managed — with real impact on the industry’s evolution.
About AndiamoTalent Partners for the AI Revolution. As a globally recognized staffing and consulting firm, we specialize in placing the top 2% of technology and go‑to‑market professionals with the world’s largest and most well‑known companies. For over 20 years, we’ve maintained the status of tier‑one vendor for firms such as Palantir, Amazon, Fluidstack, Bloomberg, Relativity Space, Firefly, Master Card, Visa, Two Sigma, Citadel, and other major financial services firms, elite hedge funds, Google‑backed tech start‑ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).