×
Register Here to Apply for Jobs or Post Jobs. X

Software Engineer, ML Infrastructure

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Attentive Mobile, Inc.
Full Time position
Listed on 2026-06-18
Job specializations:
  • IT/Tech
    Machine Learning/ ML Engineer, AI Engineer (Applied/Software), Cloud Computing: Infrastructure & Operations
Salary/Wage Range or Industry Benchmark: 200000 - 250000 USD Yearly USD 200000.00 250000.00 YEAR
Job Description & How to Apply Below
Position: Staff Software Engineer, ML Infrastructure

Who we are

We’re looking for a self-motivated, highly driven Staff Software Engineer to join our Machine Learning Operations (MLOps) team. As a team, we enable Attentive’s Machine Learning (ML) practice to directly impact Attentive’s AI product suite through the tools to train, inference, and deploy ML models with higher velocity and performance, while maintaining reliability. We build and maintain a foundational ML platform spanning the full ML lifecycle for consumption by ML engineers and data scientists.

This is an exciting opportunity to join a rapidly growing MLOps team at the ground floor with the ability to drive and influence the architectural roadmap enabling the entire ML organization at Attentive.

This team and role is responsible for building and operating the ML compute and orchestration architecture here at Attentive, which currently consists of a hosted notebook solution with Spark on AWS EMR, a multi-cluster CPU and GPU-enabled training and inference orchestrator leveraging Metaflow on Argo Workflows, and an ML feature store. We are excited to bring on more engineers to continue expanding this stack.

Why

Attentive needs you
  • Define and lead cross-functional ML infrastructure and ML platform projects
  • Demonstrate the ability to analyze, troubleshoot, coordinate, and resolve complex ML infrastructure issues
  • Orchestrate Kubernetes and ML training / inference infrastructure exposed as an ML platform
  • Expose and manage environments, interfaces, and workflows to enable ML engineers to develop, build, and test ML models and services
  • Manage and expand our feature store implementation that allows ML teams to self-service data labeling, feature engineering, and batch inferencing
  • Close the latency gap on model inference to online, real-time model serving
  • Develop automation workflows to improve team efficiency and ML stability
  • Analyze and improve efficiency, scalability, and stability of various system resources
  • Partner with other teams and business stakeholders to deliver business initiatives
  • Help onboard new team members, provide mentorship and enable successful ramp up on your team's code bases
About you
  • You have been working in the areas of MLOps / ML Platform / Data Platform / Site Reliability Engineering / Dev Ops / Infrastructure for 8+ years, and have an understanding of best practices for Dev Ops applied to ML
  • You have successfully led major cross-functional, cross-team ML infrastructure or ML platform projects
  • Your passion is infrastructure and exposing platform capabilities through interfaces that enable high performance ML practices, rather than designing ML experiments (this team does not directly develop ML models)
  • You have deep experience in Kubernetes applied to ML use cases such as CPU & GPU training, hosting and exposing ML tools, and managing ML endpoints as web services
  • You understand the key differences between online and offline ML inferences and can voice the critical elements to be successful with each
  • You have a background in software development and are passionate about bringing that experience to bear on the world of ML infrastructure
  • You have experience with Infrastructure as Code using Terraform and can’t imagine a world without it
  • You understand the importance of CI/CD in building high-performing teams and have worked with tools like Jenkins, CircleCI, Argo Workflows, and ArgoCD
  • You are passionate about observability and worked with tools such as Splunk, Nagios, Sensu, Datadog, New Relic
  • You are very familiar with containers and container orchestration and have direct experience with vanilla Docker as well as Kubernetes as both a user and as an administrator.
Sample Projects
  • Design and lead implementation of an online inference pipeline with champion/challenger model testing
  • Unite existing pipelines across data, ML, and platform teams to handle low-latency, high volume real-time streaming use cases in production inference workflows
  • Define golden path build and release pipelines for better reliability and Python package management
  • Identify opportunities to improve scalability, resiliency, and cost efficacy of GPU training and inference workflows
  • Design and lead implementation of a…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary