Machine Learning Engineer
Listed on 2026-02-16
-
IT/Tech
Systems Engineer, Machine Learning/ ML Engineer
The Role
We’re looking for an ML Ops Engineer to own the infrastructure and systems that move machine learning models from research into reliable, observable, production-grade clinical workflows.
This role sits at the intersection of deep learning systems, infrastructure, and production engineering
. You will partner closely with research, backend, and product teams to ensure models are deployable, scalable, measurable, and correct in real-world environments.
This is a hands-on role with ownership across training pipelines, inference systems, monitoring, and iteration loops
.
- Deploy, operate, and optimize GPU-based inference systems for low-latency, high-throughput workloads.
- Own model serving infrastructure, including batching, caching, and runtime optimization
. - Implement and maintain APIs for real-time model inference
.
- Design and maintain CI/CD pipelines for model training, testing, validation, and rollout.
- Build reproducible experimentation frameworks for training, tuning, and deployment cycles.
- Manage distributed training and inference infrastructure
, including GPU scheduling and scaling.
- Profile and benchmark models in production, identifying bottlenecks in latency, memory, and throughput
. - Design observability systems to track model performance, drift, failures, and uptime
. - Use production signals to drive iteration decisions and system-level improvements
.
- Partner with research teams to transition models from research to production systems.
- Collaborate with product engineers and clinicians to meet real-world workflow constraints
. - Make clear, defensible tradeoffs between model quality, system cost, and operational reliability
.
- 4+ years of experience in ML Ops, infrastructure, or distributed systems
. - Strong hands-on experience deploying and operating GPU-based inference systems
. - Deep familiarity with Py Torch , including performance tuning and debugging.
- Proven ability to own systems end-to-end and operate independently in ambiguous environments.
- Experience optimizing LLM or deep learning inference (batching, caching, memory efficiency).
- Comfort reasoning about distributed systems tradeoffs (compute, communication, scaling).
- Clear ownership of production systems—not just research exposure.
- Familiarity with DICOM, HL7
, or healthcare data standards. - Experience working in regulated or safety-critical ML environments
. - Experience with Docker, Kubernetes
, and cloud environments (AWS or GCP).
We hire for clarity, ownership, and judgment.
The ideal engineer:
- Thinks in systems. Sees beyond individual tasks to how everything connects.
- Executes with precision. Moves quickly without sacrificing long-term quality.
- Owns outcomes. Takes responsibility across design, build, and delivery.
- Builds with purpose. Writes code that improves lives, not just benchmarks.
You’ll work directly with leading engineers, clinicians, and researchers from UC Berkeley and UCSF — building products that didn’t exist before. If you want to shape how AI enters the clinic, and you care about craft as much as impact, this is your team.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).