×
Register Here to Apply for Jobs or Post Jobs. X

Senior Machine Learning Engineer, DevOps​/SRE

Job in San Jose, Santa Clara County, California, 95199, USA
Listing for: Roku
Full Time position
Listed on 2026-06-19
Job specializations:
  • IT/Tech
    Machine Learning/ ML Engineer, SRE/Site Reliability, Cloud Computing: Infrastructure & Operations
Salary/Wage Range or Industry Benchmark: 148750 - 361000 USD Yearly USD 148750.00 361000.00 YEAR
Job Description & How to Apply Below

Teamwork makes the stream work.
Roku is changing how the world watches TV

Roku is the #1 TV streaming platform in the U.S., Canada, and Mexico, and we've set our sights on powering every television in the world. Roku pioneered streaming to the TV. Our mission is to be the TV streaming platform that connects the entire TV ecosystem. We connect consumers to the content they love, enable content publishers to build and monetize large audiences, and provide advertisers unique capabilities to engage consumers.

From your first day at Roku, you'll make a valuable - and valued - contribution. We're a fast-growing public company where no one is a bystander. We offer you the opportunity to delight millions of TV streamers around the world while gaining meaningful experience across a variety of disciplines.

About the team

The Advertising Performance group focuses on performance for all participants in the Advertising ecosystem
- Advertisers, Publishers, and Roku. The systems and solutions span multiple disciplines and technologies to perform real-time multi-objective optimization across distributed systems at large scale and with low latency. We use Machine Learning, Reinforcement Learning, AI, Control and Optimization Systems, and Auction Dynamics to solve a large set of complex problems. At the core of this is our Machine Learning, Experimentation, and Inference Platform that powers the entire landscape, which we continuously evolve over time.

About

the role

We are seeking a talented and experienced Senior Software Engineer, MLOps/Dev Ops, to join the Advertising Performance team and play a critical role in supporting and scaling our Machine Learning infrastructure. The ideal candidate has a strong background in Dev Ops/SRE practices, cloud infrastructure management, and MLOps tooling — with a passion for building platforms that accelerate ML experimentation and deployment at internet scale.

You will partner closely with ML Scientists and Engineers to streamline the end-to-end ML lifecycle across training, evaluation, deployment, and monitoring — on top of a modern, cloud-native stack running on GCP and AWS using Kubernetes, Apache Airflow, Spark, Ray, MLflow, Chronon, etc.

For California Only
- The estimated annual salary for this position is between $148,750 - $361,000 annually. Compensation packages are based on factors unique to each candidate, including but not limited to skill set, certifications, and specific geographical location. This role is eligible for health insurance, equity awards, life insurance, disability benefits, parental leave, wellness benefits, and paid time off.

What you’ll be doing
  • Lead the design and operation of scalable, production-grade cloud infrastructure for ML workloads across AWS and GCP, including GPU/TPU-based training and inference environments
  • Architect and improve CI/CD systems for ML models and platform services to enable fast, reliable, and safe production releases
  • Own and evolve low-latency infrastructure for real-time model inference, including KV store and vector databases
  • Define and enforce observability standards for ML systems, including model performance monitoring, drift detection, capacity planning, and pipeline health metrics
  • Participate in on-call rotation, leading incident response and root-cause analysis for critical ML training and serving infrastructure
  • Partner with data scientists and ML engineers to improve platform usability, accelerate model iteration, and implement strong MLOps and SRE best practices
  • Champion operational excellence across ML infrastructure through automation, resilience engineering, disaster recovery planning, and continuous improvement
We’re excited if you have
  • BS or MS in Computer Science, Engineering, or a related quantitative field
  • 8+ years of experience in Dev Ops, SRE, or ML infrastructure, including 4+ years supporting large-scale ML or AI systems
  • Strong programming skills in Python, and/or Scala, or Java for platform automation and tooling
  • Deep experience with Kubernetes and container orchestration on GCP (GKE) and/or AWS (EKS)
  • Expertise with No

    SQL or low-latency data stores such as Aerospike or similar technologies
  • Hands‑on experience…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary