Principal, Software Engineer
Listed on 2026-05-22
-
Software Development
AI Engineer (Applied/Software), Software Engineer
1345 Crossman Ave
Sunnyvale, CA
Walmart processes more transactions in a day than most companies handle in a year. When performance degrades or systems fail, the impact is immediate — measured in millions of dollars and hundreds of millions of customers. We’re building the team that prevents that using agentic AI.
As a Principal Engineer in Performance and Resiliency Engineering, you’ll architect and lead the development of intelligent, self‑healing systems: LLM‑based agents that detect anomalies, reason across observability data, and trigger automated remediation — without waiting for a human in the loop. You’ll operate at a scale most AI engineers never encounter: 10,500 stores, 240M weekly customers, and infrastructure that powers one of the world’s largest retail ecosystems.
This isn’t a research role or a proof‑of‑concept environment. You’ll own the technical strategy, set architectural direction, and ship to production — building agentic systems that directly impact Walmart’s global reliability and business continuity.
About the TeamBuilding the right technology foundation for Infrastructure & Platforms is vital to success at Walmart’s scale. Our team builds and maintains the foundational technologies that power the entire tech organization — data platforms, enterprise architecture, Dev Ops, cloud computing, and infrastructure. We ship to production weekly, run blameless postmortems, and treat chaos experiments as first‑class engineering work. If you thrive in high‑ownership environments where your architectural decisions have immediate, measurable impact, this is where you belong.
Whatyou'll do What You'll Own
You’ll set the technical direction — not just execute it. From initial architecture through production deployment, you’ll own the roadmap for Walmart’s agentic AI platform for performance and resiliency. You’ll have the autonomy to make architectural tradeoffs, drive experimentation, and shape how intelligent systems operate at enterprise scale.
Key Responsibilities Build & Lead Agentic AI Systems- Architect production multi‑agent pipelines — from RAG‑based knowledge grounding to LLM‑driven decision‑making and autonomous remediation — operating across 10,500 stores and 240M weekly customers
- Own LLM evaluation standards for production: factuality, consistency, safety guardrails, and failure modes; set the bar that other teams adopt
- Optimize LLM inference at scale through prompt caching, quantization, and retrieval filtering — measurable latency and cost impact, not theoretical gains
- Integrate vector databases and observability stacks to build context‑aware systems that act on live signals without human intervention
- Build the AI/ML layer that moves Walmart from reactive incident response to predictive, self‑correcting infrastructure — cutting mean time to recovery across critical systems
- Design and run chaos experiments that expose real failure modes and change architecture decisions — not checkbox exercises
- Define SLOs that reflect real business impact, integrate performance gates into CI/CD, and make observability (Grafana, Prometheus, ELK, Splunk) actionable across the org
- Write and maintain runbooks that teams actually use: tested, updated after every incident, and clear enough to act on under pressure
- Set the architectural direction for the org’s agentic AI platform — from initial design through production deployment — and own the decisions that follow
- Close the gap between experimentation and production: move ML models from notebooks into reliable, monitored systems that hold up under Black Friday‑scale traffic
- Raise the technical floor through design reviews and mentoring that produces engineers who make better decisions independently
- Shape the multi‑year roadmap for AI‑powered performance and resiliency, influencing infrastructure investment decisions across the org
- 10+ years of experience building and operating distributed systems at scale
- Proven, hands‑on production experience with LLMs, agentic frameworks, or RAG‑based systems
- Deep background in performance engineering, chaos…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).