ML Ops Engineer Job Kent area,Washington USA,IT/Tech

About the Role

Our client's Vision AI platform runs where the data is generated — on-premises, inside government facilities, and at the network edge — not in a hyperscaler cloud. That means the infrastructure has to be bulletproof: GPU clusters provisioned correctly, Kubernetes workloads scheduled efficiently across heterogeneous compute, storage performing at the throughput AI training and inference demands, and the network capable of handling high‑bandwidth, low‑latency sensor data at scale.

As a MLOps / AI Infrastructure Engineer, you will own all of it. You will rack, configure, and operate the on‑premises compute and GPU infrastructure that powers the platform, build and maintain the Kubernetes clusters that orchestrate AI workloads, design the networking fabric that ties edge nodes to core compute, and implement the MLOps pipelines that take models from development to production.

You will work directly with our AI/ML engineers, the Lead Architect, and on‑site client technical teams to ensure the platform runs reliably in environments that are often air‑gapped, physically secured, and subject to strict government compliance requirements.

Key Responsibilities GPU Compute & Hardware Infrastructure

Deploy, configure, and maintain on‑premises GPU servers — primarily NVIDIA H200 and A100 nodes — including driver management, CUDA toolkit versioning, NVLink/NVSwitch topology, and firmware updates.
Implement and tune NVIDIA‑specific tooling: DCGM (Data Center GPU Manager) for health monitoring and telemetry, MIG (Multi‑Instance GPU) partitioning for multi‑tenant workloads, and NVIDIA Container Toolkit for GPU‑aware containerization.
Manage bare‑metal provisioning workflows (iPXE, PXE, or tools such as MAAS/Foreman) to enable repeatable, auditable server builds at client sites.
Monitor hardware health, capacity utilization, and thermal/power envelopes; define alerting thresholds and respond to hardware failures with minimal service disruption.

Kubernetes & Container Orchestration

Build, upgrade, and maintain production‑grade Kubernetes clusters (kubeadm or Rancher RKE2) on bare‑metal infrastructure, with GPU node pools configured via the NVIDIA GPU Operator.
Design and operate cluster networking using CNI plugins appropriate for high‑throughput AI workloads — Calico, Cilium, or SR‑IOV for RDMA‑capable networking where required.
Configure and manage MetalLB or equivalent bare‑metal load balancing, ingress controllers, and service mesh components (Istio or Linkerd) for secure intra‑cluster communication.
Implement resource quotas, Limit Ranges, Priority Classes, and node affinity/taints to ensure AI training jobs, inference services, and platform workloads coexist without resource contention.
Maintain cluster security posture: RBAC policies, Pod Security Admission, network policies, secrets management (Hashi Corp Vault or Sealed Secrets), and CIS Kubernetes Benchmark compliance.

MLOps Pipelines & AI Workload Management

Deploy and operate MLOps platforms (MLflow, Kubeflow, or equivalent) for experiment tracking, model versioning, and pipeline orchestration across training and inference workloads.
Configure and manage NVIDIA Triton Inference Server for multi‑model serving, dynamic batching, and model ensemble execution on GPU nodes.
Build CI/CD pipelines for model deployment (Git Ops with ArgoCD or Flux), including automated model validation, canary rollouts, and rollback mechanisms.
Optimize GPU utilization for both batch training jobs (Volcano or KUEUE scheduler) and latency‑sensitive inference services, tracking efficiency metrics via DCGM and Prometheus.
Manage model artifact storage and versioning using software‑defined storage backends (Ceph RBD/CephFS or MinIO) integrated with the MLOps toolchain.

Networking & Storage Architecture

Design and implement the high‑bandwidth network fabric required for GPU cluster interconnects—Infini Band, RoCE v2, or high‑speed Ethernet—and ensure RDMA is correctly configured for distributed training workloads.
Deploy and operate software‑defined storage solutions (Ceph or equivalent) providing block, object, and file storage tiers for training datasets, model checkpoints, and…