×
Register Here to Apply for Jobs or Post Jobs. X

Senior ML Platform Engineer; AI Farm

Job in Toronto, Ontario, M5A, Canada
Listing for: 0000050007 Royal Bank of Canada
Full Time position
Listed on 2026-06-18
Job specializations:
  • IT/Tech
    Data Engineering, AI Engineer (Applied/Software)
Job Description & How to Apply Below
Position: Senior ML Platform Engineer (AI Farm)

Job Description

What's the opportunity?We're looking for a Senior ML Platform Engineer to join the AI Farm team — RBC's enterprise GPU compute and data platform for machine learning. You'll own and deliver critical platform capabilities that enable hundreds of ML researchers and engineers to train models, access data, and deploy s isn't a typical MLOps role. You'll be building the platform itself — the Kubernetes infrastructure, data access layer, compliance automation, and developer tooling that our ML teams depend on daily.

You'll work at the intersection of distributed systems, data engineering, and platform engineering, solving problems like multi-tenant GPU scheduling, data governance enforcement, and self-serve infrastructure provisioning.

At RBC Borealis, you'll join a small, high-impact team that operates AI Farm — an on-premise Open Shift + Run:

AI cluster with H100, B300, and A100 GPUs serving multiple business units. You'll have direct ownership over system design decisions and ship features that immediately impact researcher productivity.
Your responsibilities include:
  • Designing and building Kubernetes-native automation for platform operations: PV lifecycle management, namespace provisioning, compliance scanning, and workload enforcement

  • Owning the data infrastructure layer:
    Trino/Starburst cluster operations, column-level data masking, resource group management, and catalog provisioning automation

  • Building developer-facing tools and libraries (Python SDK, CLI) that reduce cognitive load for ML teams accessing data and compute

  • Implementing data governance and compliance systems: automated scanning, classification integration, retention enforcement, and audit reporting

  • Designing and operating observability pipelines:
    Grafana dashboards for GPU utilization, developer experience metrics, pipeline throughput measurement, and compliance coverage

  • Collaborating with INFRA, security, and compliance teams to design and enforce platform policies (OPA admission webhooks, image enforcement, access controls)

  • Contributing to architecture decisions (ADRs) and owning end-to-end delivery of multi-sprint epics with cross-team dependencies

  • You're our ideal candidate if you have:
    Must Have:
  • 5+ years of industry experience in software/platform engineering

  • Deep hands-on experience with Kubernetes in production (pod security, RBAC, storage classes, Cron Jobs, admission webhooks, custom controllers). Open Shift experience is a strong plus.

  • Proficiency in Python for building production tools, automation scripts, CLIs, and libraries

  • Experience operating distributed data systems (Trino/Presto/Spark, SQL engines, Iceberg/Hive catalogs, or similar)

  • Strong CI/CD and automation skills (Git Hub Actions, Helm, Git Ops, infrastructure-as-code)

  • Experience building multi-tenant platforms with self-serve provisioning for internal teams

  • Ability to own and deliver complex, ambiguous projects end-to-end with minimal direction

  • Strong Preference:
  • Experience with data governance, compliance automation, or security enforcement on shared platforms

  • Hands-on Prometheus/Grafana: building dashboards, alerting, and instrumentation from scratch

  • Container image lifecycle management (registries, scanning, enforcement policies)

  • Experience with GPU compute platforms (Run:AI, Slurm, or cloud GPU scheduling)

  • Familiarity with S3-compatible object storage and persistent volume management

  • Nice to Have:
  • Experience with Trino/Starburst (resource groups, connectors, column masking, SEP)

  • OPA/Gatekeeper policy-as-code experience

  • Familiarity with ML workflows (training jobs, experiment tracking, model serving) — enough to empathize with platform users

  • Experience in regulated industries (financial services, healthcare) with compliance requirements

  • Strong fundamentals in networking, storage, and distributed systems

  • What's in it for you?
  • Own significant platform capabilities on a small team with high autonomy and direct business impact

  • Work with cutting-edge GPU hardware (NVIDIA B300, H100, A100) powering real ML research

  • Collaborate with high-performing engineers and AI researchers solving problems in finance

  • A comprehensive Total Rewards Program including bonuses and flexible…

  • Position Requirements
    10+ Years work experience
    Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
    To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
     
     
     
    Search for further Jobs Here:
    (Try combinations for better Results! Or enter less keywords for broader Results)
    Location
    Increase/decrease your Search Radius (miles)
    0
    200
    Filters
    Education Level
    Experience Level (years)
    Posted in last:
    Salary