×
Register Here to Apply for Jobs or Post Jobs. X

Senior Machine Learning Engineer MLOps​/LLMOps Redwood , CA

Job in Redwood City, San Mateo County, California, 94061, USA
Listing for: Sumo Logic
Full Time position
Listed on 2026-02-17
Job specializations:
  • IT/Tech
    AI Engineer, Machine Learning/ ML Engineer, Cloud Computing, Data Scientist
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Position: Senior Machine Learning Engineer - I (MLOps/LLMOps) New Redwood City, CA

Redwood City, CA

As a Senior Machine Learning Engineer - MLOps/LLMOps, you will design, build, and scale production-grade infrastructure and platforms that enable the full lifecycle of ML and LLM systems. You'll architect robust pipelines for model training, evaluation, deployment, and monitoring while ensuring reliability, observability, and efficiency s role collaborates closely with ML Engineers, Data Scientists, and Product teams to operationalize AI/ML solutions from prototype to production.

Responsibilities
  • Platform & Infrastructure: Design and implement scalable MLOps/LLMOps platforms supporting the full ML lifecycle: data versioning, model training, evaluation, deployment, and monitoring.
  • Build and maintain CI/CD pipelines for ML models and LLM applications with automated testing, validation, and rollback capabilities.
  • Develop infrastructure-as-code (IaC) for reproducible, version-controlled ML environments.
  • Architect model serving infrastructure with auto-scaling, A/B testing, and canary deployment capabilities.
  • LLM Operations: Build platforms for LLM fine-tuning, prompt management, and experimentation at scale.
  • Implement evaluation frameworks for LLM performance, quality, safety, and cost optimization.
  • Design and deploy enterprise-grade AI agents and copilots with robust monitoring and guardrails.
  • Establish LLM observability: token usage tracking, latency monitoring, prompt/response logging, and cost attribution.
  • Operational Excellence: Own uptime, reliability, and performance of ML/LLM services (SLIs/SLOs).
  • Implement comprehensive monitoring, alerting, and incident response for ML systems.
  • Participate in on‑call rotations and drive post‑incident reviews to improve system resilience.
  • Build automation and tooling to reduce toil and accelerate ML development velocity.
  • Partner with ML Engineers and Data Scientists to translate research into production‑ready systems.
  • Collaborate with platform and infrastructure teams on cloud architecture and resource optimization.
  • Mentor team members on MLOps best practices, production ML patterns, and operational excellence.
  • Drive technical decisions on tooling, frameworks, and architectural patterns.
Required Qualifications and Skills
  • Education:

    B.S./M.S./Ph.D. in Computer Science, Engineering, or related technical field.
  • Experience:

    4+ years of software engineering experience with 2+ years focused on MLOps/LLMOps.
  • MLOps Expertise:
    • Production experience with ML model serving frameworks (e.g., Tensor Flow Serving, Torch Serve, Triton).
    • Hands‑on with ML experiment tracking and model registry tools (MLflow, Weights & Biases, Kubeflow).
    • Proficiency in workflow orchestration (Airflow, Prefect, Kubeflow Pipelines, Metaflow).
  • LLMOps Expertise:
    • Experience with LLM deployment, fine‑tuning, and evaluation frameworks (e.g., vLLM, Lang Chain, Llama Index).
    • Knowledge of prompt engineering, RAG architectures, and LLM application patterns.
    • Familiarity with LLM observability tools (e.g., Lang Smith, Arize, Why Labs).
  • Strong experience with major cloud providers (AWS, GCP, or Azure) and ML‑specific services (Sage Maker, Vertex AI, Azure ML, Bedrock).
  • Proficiency in containerization (Docker, Kubernetes) and infrastructure‑as‑code (Terraform, Cloud Formation, Pulumi).
  • Experience with microservices architecture and API development (REST, gRPC).
  • Strong programming skills in Python, Terraform, and Helm; familiarity with Go, Java, or Rust is a plus.
  • Deep understanding of CI/CD practices and tools (Git Hub Actions, Git Lab CI, Jenkins, ArgoCD).
  • Experience with monitoring and observability stacks (Prometheus, Grafana, Data Dog, ELK).
  • Track record of managing production systems with defined SLIs/SLOs.
  • Experience with on‑call rotations, incident management, and reliability engineering practices.
Desired Qualifications and Skills
  • Experience building internal ML platforms or developer tooling used by multiple teams.
  • Hands‑on with distributed training frameworks (Ray, Horovod, Deep Speed).
  • Knowledge of model optimization techniques (quantization, distillation, pruning).
  • Familiarity with feature stores (Feast, Tecton) and data versioning tools (DVC, LakeFS).
  • Understanding of ML security best practices, model governance,…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary