Principal TPM -AI Infrastructure Job Dover area,Delaware USA,IT/Tech

** Job Description*
* The AI Infrastructure GPU Operations Team drives deployment planning, execution governance, operational readiness, reliability, and business rhythm for OCI's rapidly expanding GPU infrastructure portfolio. As Principal Technical Program Manager, you will lead cross-functional programs that connect engineering, platform, operations, business, finance, observability, SRE, network, and leadership teams across complex GPU operations initiatives.

You will own operating mechanisms for regional deployment readiness, GPU fleet health, milestone tracking, executive reporting, incident and change governance, risk management, and operational handoff across multiple concurrent GPU operations programs. This role requires strong program discipline, business analytics capability, and the ability to turn ambiguous technical and operational inputs into clear priorities, metrics, decisions, and action plans.

You will also improve the way the organization scales by strengthening dashboards, telemetry, documentation, onboarding, playbooks, repeatable processes, and the practical use of AI to improve operations productivity. The ideal candidate brings crisp communication, strong ownership, and pragmatic simplification to high-visibility GPU operations programs where disciplined execution, customer impact, and measurable reliability outcomes matter.

You are a structured, data-driven program leader who values simplicity, scalability, reliability, and clear operational mechanisms. You thrive in collaborative environments, communicate crisply with senior stakeholders, and drive consistent execution through ownership, metrics, and disciplined follow-through. You combine strategic clarity with enough technical and operational depth to help teams deliver reliable OCI AI Infrastructure GPU Operations while continuously improving the processes, telemetry, and automation that support it.

Travel: as needed for cross-site coordination, stakeholder alignment, and partner engagements.

** Responsibilities*
* Key Responsibilities GPU Fleet Operations & Reliability

+ Drive availability and reliability of large-scale GPU fleets, identifying systemic issues and leading cross-functional recovery efforts.

+ Support operational readiness and performance of distributed AI training and inference workloads across multi-region GPU clusters.

+ Lead GPU fleet health reviews across current and next-generation hardware, including NVIDIA H200, B200, GB200/GB300 platforms and AMD Instinct MI300X, MI325X, MI350X, MI355X, and related platforms.

Program Leadership & Execution

+ Own end-to-end execution of critical AI Infrastructure GPU Operations programs, ensuring alignment with business priorities, customer needs, and operational risk signals.

+ Set and run weekly operating cadences and governance forums across multiple concurrent initiatives, ensuring clear ownership, timelines, dependencies, decision points, and committed actions.

+ Coordinate cross-functional delivery across engineering, platform, operations, business operations, finance, observability, SRE, network, and senior leadership stakeholders.

Incident, Change & Deployment Governance

+ Manage deployment governance, change review, readiness tracking, stakeholder handoff, and operational execution processes.

+ Establish and scale structured incident management mechanisms, improving root cause analysis, corrective and preventive actions, and follow-through on durable fixes.

+ Serve as a primary escalation point between engineering and operations teams, resolving priority conflicts and accelerating issue resolution.

+ Lead Change Review Board processes for high-volume change activity, minimizing change-related incidents and protecting service quality.

Business Planning, Metrics & Executive Reporting

+ Build, model, and maintain business planning inputs, financial forecasts, analytical views, and operating reports for AI Infrastructure GPU Operations programs.

+ Own executive-level reporting, including monthly business reviews, weekly operational KPIs, critical project updates, risks, dependencies, decisions, and mitigation plans.

+ Provide data-driven insights into infrastructure…