AI Platforms Leader Enterprise AI Platforms Job San Diego area,California USA,IT/Tech

##
Company:

Qualcomm Incorporated

## Job Area:

Engineering Group, Engineering Group Software Engineering

General

Summary:

We're seeking an experienced AI Platforms Leader to own the strategy, architecture, and operation of our end‑to‑end AI Platform-spanning on‑prem GPU clusters and cloud services (AWS/GCP/Azure). You'll lead a high-caliber engineering team to deliver reliable, secure, and cost‑efficient infrastructure for training, fine‑tuning, inference, retrieval, and agentic orchestration (including A2A patterns and MCP servers). If you love turning complex AI/ML requirements into robust, self‑service platform capabilities for builders across the company, this is your role.

This role requires full-time onsite work in San Diego, CA (5 days per week).

Key Responsibilities

* Own the AI Platform strategy & roadmap

* Define the multi‑year vision for a multi‑tenant, hybrid (on‑prem + cloud) AI platform, aligned to business needs, developer productivity, and cost efficiency.

* Establish clear platform SLAs/SLOs, reliability goals, and security/compliance guardrails.

* Run GPU-based compute at scale

* Operate and optimize on‑prem GPU clusters (e.g., Kubernetes + GPU operator and/or Slurm), including capacity planning, scheduling, partitioning, NCCL, and high‑throughput storage/networking.

* Drive GPU utilization efficiency, right‑sizing, and cost transparency across training and inference workloads.

* Deliver MLOps & LLMOps as a product

* Provide golden paths for data prep, training/fine‑tuning, model registry, lineage, governance, evaluation, red‑teaming, and safe deployment (batch, online, streaming).

* Implement CI/CD for models, prompts, and agents; automate evaluations and rollout/rollback with canaries, A/B, and shadow deployments.

* Agentic AI, A2A, and MCP ecosystem

* Lead the design and operation of agentic orchestration (A2A patterns), tool integration, and MCP (Model Context Protocol) servers to securely expose enterprise tools and data.

* Standardize agent capability schemas, guardrails, observability, and policy enforcement.

* Cloud AI/ML platforms

* Leverage AWS/Azure AI services for training and inference (e.g., Bedrock/Sage Maker/EKS; Azure AI Studio/Azure ML/AKS/Azure OpenAI) with robust networking, identity, secrets, and cost controls.

* Establish multi‑cloud patterns for portability, resilience, and vendor risk management.

* Platform engineering & Dev Ops excellence

* Own core platform services: identity/RBAC, secrets, service meshes, observability (logs/metrics/traces), data access controls, vector stores, feature stores, and model gateways (e.g., KServe/Triton/vLLM).

* Use Git Ops/IaC (Terraform/Bicep/Helm) and secure software supply chain practices (SBOMs, image signing, policy as code).

* Operational leadership

* Lead a ~10‑engineer global team (platform, SRE, MLOps/LLMOps) with global collaboration, 24×7 readiness, and a healthy on‑call rotation.

* Drive incident response, post‑mortems, and continuous improvement. Partner with Security, Legal, and Compliance for model/data governance.

* Stakeholder & vendor management

* Partner with product, data, and application teams to enable high‑impact AI use cases.

* Manage strategic vendors (e.g., cloud, GPU, enterprise AI tooling) and negotiate licenses/SOWs aligned to roadmap and budget.

Required Qualifications

* 15+ years overall engineering/technology experience, including ~10 years building and operating large‑scale platforms (AI/ML, data, or high‑performance computing).

* Leadership:
Proven experience leading a team of ~10 engineers for 5+ years, across platform/SRE/MLOps/LLMOps, with coaching, hiring, performance management, and clear execution rhythms.

* GPU cluster expertise:
Hands‑on operations for on‑prem GPU clusters (Kubernetes + GPU operator and/or Slurm), scheduling, capacity planning, performance tuning, and reliability.

* MLOps & LLMOps:
Strong experience with model lifecycle (data → training → registry → deployment), model/agent evaluation, safety/guardrails, and observability.

* Cloud (AWS/GCP/Azure):
Deep experience with AI/ML services and managed Kubernetes (EKS/AKS/GKE), networking, security, identity, and cost management.

* Dev Ops/Platform…