Senior Platform Engineer - ML Infrastructure Job Winnipeg area,Manitoba Canada,IT/Tech

Position: Senior Platform Engineer - ML Infrastructure )

Job Overview

Join Team CARFAX as a Senior Platform Engineer – ML Infrastructure. We are looking for a seasoned Senior Platform Engineer to join our platform team and take an active role in designing, scaling, and operating the infrastructure that powers Large Language Model (LLM) development and hosting.

This is a high‑impact, highly technical position where you will own critical platform components, drive architectural decisions, and directly shape the reliability, performance, and security of our AI infrastructure. At its core, this is a Kubernetes‑first, cloud‑native platform engineering role. We care deeply about your ability to architect and operate scalable, resilient infrastructure for LLM workloads— the specific cloud or tooling background is secondary.

Our current platform runs on AWS with EKS, Flyte, ArgoCD, Jupyter Hub, and the LGTM observability stack, and you'll work within that environment.

We are looking for an engineer who thrives at the intersection of AI/ML and cloud‑native infrastructure, who gets excited about solving the unique scaling and operational challenges that LLM workloads demand, and who wants to work on technology that sits at the absolute cutting edge of the AI industry.

The position requires 2 days in the London, ON office per week. Our four‑day week continues in Summer 2026.

What You’ll Own

LLM Platform Architecture – Actively participate in the design and evolution of the core infrastructure platform supporting LLM training, fine‑tuning, and inference workloads at scale.
Kubernetes & Advanced Autoscaling – Own the design and implementation of sophisticated K8s autoscaling strategies (HPA, VPA, KEDA, Cluster Autoscaler) tailored to the highly variable and GPU‑intensive demands of LLM workloads.
ML Workflow Orchestration – Participate in the engineering and optimization of ML pipeline infrastructure, contributing to best practices for pipeline design, resource allocation, and workflow reliability across LLM training and evaluation workloads.
AI Developer Platform – Own and contribute to the architecture and operations of interactive compute environments used by AI researchers and LLM engineers to develop, experiment, and prototype.
CI/CD & Git Ops – Participate in the development and ongoing improvement of Git Ops workflows and CI/CD pipelines, contributing to deployment best practices and enabling rapid, reliable delivery of platform changes.
Observability & Reliability – Contribute to the full observability stack implementation – designing dashboards, defining SLOs, building alerting frameworks, and ensuring deep visibility into LLM workload performance and platform health.
Cloud Infrastructure – Participate in cloud infrastructure design across compute, storage, networking, and IAM, with a strong emphasis on cost optimization and operational excellence.
Security & Compliance – Engage actively in the vulnerability assessment and remediation program across all platform components, contributing to security standards and ensuring the LLM platform meets organizational and regulatory compliance requirements.
Collaborative Engineering – Participate in technical design reviews, contribute to roadmap discussions, and serve as a knowledgeable resource and collaborative partner across AIOps and MLOps disciplines.

Required Experience & Skills

7+ years of experience in Dev Ops, Platform Engineering, MLOps, or a closely related infrastructure discipline.
Deep Kubernetes expertise – production experience operating Kubernetes at scale on any major managed platform (EKS, GKE, AKS) or on‑premises, with advanced knowledge of scheduling, autoscaling, networking, RBAC, and cluster operations.
Cloud infrastructure proficiency – extensive experience designing and operating production workloads on at least one major cloud provider (AWS, GCP, or Azure), covering compute, storage, networking, and identity and access management.
MLOps / AI Infrastructure experience – demonstrated experience building and operating infrastructure that supports ML training, model serving, or LLM workloads, including GPU resource management and scheduling at scale.
CI/CD & Git Ops – strong hands‑on experience with Git…