Machine Learning Engineer Job Berkeley area,California USA,IT/Tech

The Role

We’re looking for an ML Ops Engineer to own the infrastructure and systems that move machine learning models from research into reliable, observable, production-grade clinical workflows.

This role sits at the intersection of deep learning systems, infrastructure, and production engineering
. You will partner closely with research, backend, and product teams to ensure models are deployable, scalable, measurable, and correct in real-world environments.

This is a hands-on role with ownership across training pipelines, inference systems, monitoring, and iteration loops
.

What You’ll Do Production ML Systems

Deploy, operate, and optimize GPU-based inference systems for low-latency, high-throughput workloads.
Own model serving infrastructure, including batching, caching, and runtime optimization
.
Implement and maintain APIs for real-time model inference
.

Training & Deployment Infrastructure

Design and maintain CI/CD pipelines for model training, testing, validation, and rollout.
Build reproducible experimentation frameworks for training, tuning, and deployment cycles.
Manage distributed training and inference infrastructure
, including GPU scheduling and scaling.

Performance, Monitoring & Reliability

Profile and benchmark models in production, identifying bottlenecks in latency, memory, and throughput
.
Design observability systems to track model performance, drift, failures, and uptime
.
Use production signals to drive iteration decisions and system-level improvements
.

Cross-Functional Execution

Partner with research teams to transition models from research to production systems.
Collaborate with product engineers and clinicians to meet real-world workflow constraints
.
Make clear, defensible tradeoffs between model quality, system cost, and operational reliability
.

What We’re Looking For Core Qualifications

4+ years of experience in ML Ops, infrastructure, or distributed systems
.
Strong hands-on experience deploying and operating GPU-based inference systems
.
Deep familiarity with Py Torch , including performance tuning and debugging.
Proven ability to own systems end-to-end and operate independently in ambiguous environments.

Strong Signals

Experience optimizing LLM or deep learning inference (batching, caching, memory efficiency).
Comfort reasoning about distributed systems tradeoffs (compute, communication, scaling).
Clear ownership of production systems—not just research exposure.

Nice to Have

Familiarity with DICOM, HL7
, or healthcare data standards.
Experience working in regulated or safety-critical ML environments
.
Experience with Docker, Kubernetes
, and cloud environments (AWS or GCP).

What We Value

We hire for clarity, ownership, and judgment.

The ideal engineer:

Thinks in systems. Sees beyond individual tasks to how everything connects.
Executes with precision. Moves quickly without sacrificing long-term quality.
Owns outcomes. Takes responsibility across design, build, and delivery.
Builds with purpose. Writes code that improves lives, not just benchmarks.

Why Join Us

You’ll work directly with leading engineers, clinicians, and researchers from UC Berkeley and UCSF — building products that didn’t exist before. If you want to shape how AI enters the clinic, and you care about craft as much as impact, this is your team.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language