DevOps Engineer Job San Francisco area,California USA,IT/Tech

About Menlo Menlo Research is an Applied R&D lab building Asimov, an open-source humanoid robot platform, and the full software stack that powers it. Our mission is to make humanoid labor economically viable – turning software into physical labor build across the full stack: hardware architecture, locomotion, autonomy, simulation, and infrastructure. We move fast, ship to real robots, and open-source everything we can.

If you want your work to matter beyond a paper or a demo, this is the place.

The Role

As an Dev Ops Engineer, you will own and evolve the platform that everything at Menlo runs on – from inference serving, to training rigs, to the agentic coding infrastructure that powers day-to-day engineering. You will work deep in the stack across Kubernetes, networking, and where it matters bare metal, and help set the technical direction for how Menlo Cloud scales.

What

You’ll Do

Operate and evolve our Kubernetes platform across multiple clusters and environments (Prod, Dev, hybrid on‑prem and public cloud), covering control plane operations, node lifecycle, upgrades, and autoscaling at every layer (Cluster Autoscaler, HPA, KEDA).
Architect and manage hybrid cloud infrastructure spanning on‑premises and public clouds (GCP, AWS), including workload placement, cross‑cloud networking, and unified resource management.
Own the CI/CD and Git Ops experience end-to-end: container build pipelines, image optimization, and progressive delivery via ArgoCD / FluxCD.
Own the observability stack as a single pane of glass across all clusters:
Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, Prometheus – and help push toward agent‑assisted SRE workflows.
Manage and improve our inference platform: vLLM serving and AIBrix for multi‑model orchestration and autoscaling across a fleet of NVIDIA GPUs.
Operate platform services:
Kafka, Redis, PostgreSQL, Open Search.
Manage identity and access via Keycloak integrated with Google Workspace; harden SSO, RBAC, and secrets management across the platform.
Harden network security across private load balancers, firewalls, and VPC segmentation; design and maintain hub‑and‑spoke / multi‑AZ topologies.
Support training infrastructure: self‑service VM provisioning, Run Pod burst capacity, Weights and Biases integration.
Drive infrastructure reliability, cost efficiency, and capacity planning as the platform scales.

What We’re Looking For

Kubernetes – deep, hands‑on. Strong production experience with Kubernetes, fluent in workloads and controllers, networking (Services, Ingress, CNI basics), storage (PV/PVC, CSI), RBAC, and the autoscaling story end‑to‑end (HPA, VPA, Cluster Autoscaler, KEDA). Cloud‑managed Kubernetes (GKE, EKS, AKS) is fine; on‑premises / self‑managed Kubernetes (kubeadm, Cluster API, k3s, etc.) is a strong plus.
Networking – design‑level, not just operator‑level. You have designed real network topologies at some point in your career – hub‑and‑spoke, multi‑AZ / multi‑VPC, or an equivalent enterprise pattern – and can defend the tradeoffs. Comfortable with VPCs, firewalls, load balancers, private cluster architecture, DNS, and routing. On‑premises networking experience (VLANs, BGP, L2/L3 fabrics, pf Sense / Fortinet / Palo Alto / Cisco) is a strong plus.
CI/CD and Docker – concepts over tooling. You can build and optimize Dockerfiles (multi‑stage builds, layer caching, small/secure base images) and have owned full CI/CD pipelines end‑to‑end. Tooling is flexible – Git Hub Actions, Git Lab CI, Azure Pipelines, Jenkins, Argo Workflows, etc. – but you should be able to clearly articulate the full lifecycle of a typical pipeline, and explain how CI/CD changes when the deployment target is Kubernetes (ArgoCD / FluxCD, Git Ops patterns, progressive delivery).
Observability – you have built this before. You have stood up a full observability stack from scratch and operated it in production – metrics, logs, traces, alerting, on‑call. Familiarity with the Grafana stack (Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, Prometheus) is a strong plus. Bonus points if you have experimented with agent‑assisted SRE workflows or LLM‑driven incident triage.
SSO and identity. When you bring a…