ML Infrastructure Engineer — ML Platform, Tooling & Systems
Listed on 2025-11-03
-
IT/Tech
AI Engineer, Machine Learning/ ML Engineer, Systems Engineer, Cloud Computing
Overview
Field AI is transforming how robots interact with the real world. We are building risk-aware, reliable, and field-ready AI systems that address the most complex challenges in robotics, unlocking the full potential of embodied intelligence. We go beyond typical data-driven approaches or pure transformer-based architectures, and are charting a new course, with already-globally-deployed solutions delivering real-world results and rapidly improving models through real-field applications.
Field AI is transforming how robots interact with the real world. We are building risk-aware, reliable, and field-ready AI systems that solve the hardest challenges in autonomy — deploying globally today to unlock the full potential of embodied intelligence. Our solutions go beyond conventional data-driven ML or purely transformer-based models. We’re building real-world AI that learns from experience and delivers tangible, continuous improvements in the field.
Are you excited by the challenge of supporting ML teams with robust, scalable infrastructure? Do you want to help accelerate real-time robotics through better developer workflows and reliable systems?
Field AI is hiring an ML Infrastructure Engineer to own the software platform and tooling that enables fast, reliable AI development and deployment across our ML and robotics stacks.
What You Will Get To Do- Build ML Infrastructure & Developer Tooling
- Design and implement internal tools, libraries, and CLI utilities that streamline experimentation, model training, and evaluation.
- Improve local and cloud development environments using Docker, internal package registries, and monorepos.
- Build reusable templates and interfaces for training, evaluation, and inference pipelines.
- Support the ML Lifecycle (Data → Models → Deployment)
- Develop pipelines for dataset ingestion, transformation, versioning, and validation.
- Automate model training, evaluation, packaging, and deployment to cloud and edge environments.
- Ensure integrity and traceability across data, code, and model artifacts.
- Improve Build Systems and Developer Experience
- Maintain and evolve a shared monorepo across ML, robotics, and software teams.
- Leverage Bazel or similar systems to enable fast, reproducible builds and tests.
- Enhance developer workflows to support consistent environments and reduce friction.
- Own CI/CD and Automation for ML Systems
- Build and maintain CI/CD pipelines (e.g., Git Hub Actions, AWS Step Functions) for ML experimentation and deployment.
- Automate regression testing and benchmarking models.
- Develop observability tools: dashboards, telemetry systems, and model health monitoring.
- Collaborate Across Engineering & Research Teams
- Work closely with ML scientists, software engineers, and roboticists to translate high-level platform needs into robust engineering solutions.
- Participate in code and design reviews, documentation, and cross-team planning
- 3+ years of industry experience in software engineering, infrastructure, MLOps, or Dev Ops roles.
- Deep familiarity with the ML lifecycle, including data preparation, model training, packaging, and deployment.
- Strong software engineering foundations: proficiency with Git, Python, and system design.
- Experience building and managing containerized environments (e.g., Docker) and working with orchestration tools (e.g., Kubernetes).
- Hands-on experience with CI/CD workflows and infrastructure-as-code (e.g., Terraform, AWS CDK).
- Experience with cloud ML platforms (AWS, GCP, or Azure).
- A strong product mindset — building internal tools with empathy for researchers and engineers.
- Experience with distributed training frameworks (e.g., PyTorch DDP, FSDP, Deep Speed, Megatron).
- Familiarity with orchestrating large-scale training jobs using Kubernetes-based platforms (e.g., Ray, Sage Maker, EKS, Karpenter).
- Background in hybrid edge-cloud ML deployments or infrastructure supporting robotic systems.
- Prior work in environments requiring real-time ML performance, safety validation, or regulatory traceability.
Our salary range is between ($70,000 - $300,000 annual), but we take into consideration an individual's background…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).