Principal Software Engineer, CoreAI Engines
Job in
Redmond, King County, Washington, 98053, USA
Listed on 2026-05-21
Listing for:
Microsoft Corporation
Full Time
position Listed on 2026-05-21
Job specializations:
-
IT/Tech
AI Engineer (Applied/Software), Machine Learning/ ML Engineer
Job Description & How to Apply Below
Overview
The CoreAI Workloads team builds the foundational inference engines and APIs that power large scale AI inference across Azure - from cutting-edge startups to Fortune 500 enterprises and Microsoft Copilots and agents. Our mission is to deliver secure, reliable, and highly efficient GPU inference that enable multitenant AI systems at global scale while maximizing utilization, performance, and developer productivity. We own inference serving and performance of OpenAI and other state of the art large language model (LLM) models and work directly with OpenAI serving some of the largest workloads on the planet with trillions of inferences per day.
Our converged AI fabric and engines deliver inference capabilities for all LLMs in Microsoft catalog, including OpenAI, Anthropic, Mistral, Cohere, Llama, and more.
This role sits at the intersection of LLM inference fleets, serving efficiency, rapid experimentation, cloud infrastructure, and systems software-working closely with CoreAI data plane, compute, and partner teams to deliver end-to-end efficiencies and platform capabilities.
In this role, you will have the opportunity to work on multiple levels of the AI software stack, including the fundamental abstractions, programming models, OpenAI and OSS engines runtimes, libraries and application programming interfaces (APIs) to enable large scale inferencing of models.
You will drive production-grade inference serving improvements for OpenAI and open-source models across Azure, including benchmarking, performance measurement, and disciplined experimentation to improve latency, throughput, availability, and cost will both (1) make hands-on engine changes and (2) contribute to the experimentation capabilities that make those changes measurable, safe to ship, and repeatable across teams.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.
Responsibilities
As the Principal engineer on the team, your responsibilities include:
* Optimize inference engines for OpenAI and open-source models by implementing and shipping performance/efficiency improvements across runtime, scheduling, and serving paths (latency, throughput, utilization, availability, and cost).
* Run experiments end-to-end: formulate hypotheses, implement engine changes (including Python/PyTorch integration points where relevant), analyze results, and ship improvements behind guardrails.
* Build and use experimentation capabilities for large-scale AI inference (experiment lifecycle, tracking, metric modeling, comparability standards, automated analysis) so the team can iterate quickly and safely.
* Own serving availability and efficiency for Azure OpenAI Service workloads through tiered experimentation, lean segmentation, and multi-modal utilization across heterogeneous fleets-turning findings into shipped engine improvements.
* Design and evolve inference serving architectures to improve utilization and latency using techniques such as disaggregated serving, multi-token prediction, KV offload/retrieval, and quantization-validated via staged rollouts and production guardrails.
* Extend AI infrastructure abstractions to support elastic, heterogeneous inference engines reliably at scale (e.g., dynamic scaling across model families, modalities, and workload classes while maintaining isolation and SLOs).
* Tune and scale inference engines across NVIDIA GPU generations (A100, H100, H200) for state-of-the-art OpenAI models, focusing on serving efficiency, utilization, and reliability (not hardware bring-up).
* Partner with networking and storage teams to leverage high-performance interconnects (e.g., RDMA/Infini Band-class fabrics such as RoCE over IB) for distributed…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×