×
Register Here to Apply for Jobs or Post Jobs. X

AI Platforms Leader Enterprise AI Platforms

Job in San Diego, San Diego County, California, 92189, USA
Listing for: Nutanix
Full Time position
Listed on 2026-06-22
Job specializations:
  • IT/Tech
    AI Engineer (Applied/Software), SRE/Site Reliability, Cloud Computing: Infrastructure & Operations
Salary/Wage Range or Industry Benchmark: 200000 - 250000 USD Yearly USD 200000.00 250000.00 YEAR
Job Description & How to Apply Below
Company:

Qualcomm Incorporated Job Area:

Engineering Group, Engineering Group  Software Engineering General

Summary:

We’re seeking an experienced AI Platforms Leader to own the strategy, architecture, and operation of our end‑to‑end AI Platform—spanning on‑prem GPU clusters and cloud services (AWS/GCP/Azure). You’ll lead a high-caliber engineering team to deliver reliable, secure, and cost‑efficient infrastructure for training, fine‑tuning, inference, retrieval, and agentic orchestration (including A2A patterns and MCP servers). If you love turning complex AI/ML requirements into robust, self‑service platform capabilities for builders across the company, this is your role.

This role requires full-time onsite work in San Diego, CA (5 days per week).

Key Responsibilities Own the AI Platform strategy & roadmap

Define the multi‑year vision for a multi‑tenant, hybrid (on‑prem + cloud) AI platform, aligned to business needs, developer productivity, and cost efficiency.

Establish clear platform SLAs/SLOs, reliability goals, and security/compliance guardrails.

Run GPU-based compute at scale

Operate and optimize on‑prem GPU clusters (e.g., Kubernetes + GPU operator and/or Slurm), including capacity planning, scheduling, partitioning, NCCL, and high‑throughput storage/networking.

Drive GPU utilization efficiency, right‑sizing, and cost transparency across training and inference workloads.

Deliver MLOps & LLMOps as a product

Provide golden paths for data prep, training/fine‑tuning, model registry, lineage, governance, evaluation, red‑teaming, and safe deployment (batch, online, streaming).Implement CI/CD for models, prompts, and agents; automate evaluations and rollout/rollback with canaries, A/B, and shadow deployments.

Agentic AI, A2A, and MCP ecosystem

Lead the design and operation of agentic orchestration (A2A patterns), tool integration, and MCP (Model Context Protocol) servers to securely expose enterprise tools and data.

Standardize agent capability schemas, guardrails, observability, and policy enforcement.

Cloud AI/ML platforms

Leverage AWS/Azure AI services for training and inference (e.g., Bedrock/Sage Maker/EKS; Azure AI Studio/Azure ML/AKS/Azure OpenAI) with robust networking, identity, secrets, and cost controls.

Establish multi‑cloud patterns for portability, resilience, and vendor risk management.

Platform engineering & Dev Ops excellence

Own core platform services: identity/RBAC, secrets, service meshes, observability (logs/metrics/traces), data access controls, vector stores, feature stores, and model gateways (e.g., KServe/Triton/vLLM).Use Git Ops/IaC (Terraform/Bicep/Helm) and secure software supply chain practices (SBOMs, image signing, policy as code).Operational leadership

Lead a ~10‑engineer global team (platform, SRE, MLOps/LLMOps) with global collaboration, 24×7 readiness, and a healthy on‑call rotation.

Drive incident response, post‑mortems, and continuous improvement. Partner with Security, Legal, and Compliance for model/data governance.

Stakeholder & vendor management

Partner with product, data, and application teams to enable high‑impact AI use cases.

Manage strategic vendors (e.g., cloud, GPU, enterprise AI tooling) and negotiate licenses/SOWs aligned to roadmap and budget.

Required Qualifications
15+ years overall engineering/technology experience, including ~10 years building and operating large‑scale platforms (AI/ML, data, or high‑performance computing).Leadership:
Proven experience leading a team of ~10 engineers for 5+ years, across platform/SRE/MLOps/LLMOps, with coaching, hiring, performance management, and clear execution rhythms.

GPU cluster expertise:
Hands‑on operations for on‑prem GPU clusters (Kubernetes + GPU operator and/or Slurm), scheduling, capacity planning, performance tuning, and reliability.

MLOps & LLMOps:
Strong experience with model lifecycle (data training registry deployment), model/agent evaluation, safety/guardrails, and observability.

Cloud (AWS/GCP/Azure):
Deep experience with AI/ML services and managed Kubernetes (EKS/AKS/GKE), networking, security, identity, and cost management.

Dev Ops/Platform Engineering: CI/CD, Git Ops, IaC (Terraform/Bicep/Helm),…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary