×
Register Here to Apply for Jobs or Post Jobs. X

Distinguished, Software Engineer -AI​/ML Engineer - Agentic Systems

Job in Sunnyvale, Santa Clara County, California, 94089, USA
Listing for: Wal-Mart
Full Time position
Listed on 2026-06-23
Job specializations:
  • IT/Tech
    AI Engineer (Applied/Software), Systems Engineer, Cloud Computing: Infrastructure & Operations, Machine Learning/ ML Engineer
Job Description & How to Apply Below
Position Summary...

What you'll do...

As a Distinguished AI/ML Engineer within Walmart Global Tech's Reliability Engineering Organization, you will lead the technical development of next-generation agentic AI systems and intelligent automation solutions that ensure mission-critical reliability, scalability, and operational excellence across Walmart's entire technology ecosystem. You will architect and implement cutting-edge machine learning platforms and autonomous agents that transform how we manage change and performance, monitor, predict, and automatically resolve issues across all Walmart systems, supporting millions of associates and customers globally.

Walmart Global Tech's Reliability Engineering Organization is built with hybrid systems and software engineers who take technical ownership for change engineering, change management, performance engineering, reliability, scalability, automation, and mission-critical issues related to uptime, availability, and rapid continuous improvement across Walmart's e-commerce, stores, and omni-channel platforms. As a technical expert in this domain, you will drive the evolution of practices into AI-powered, self-healing, and autonomous systems built on modern technology stacks with intelligent change management and predictive performance optimization.

You will also define and implement unified, intelligent, and operationally robust technical solutions and tools for Walmart Technology organizations across all channels and geographies.

About the Team The Reliability Engineering Organization at Walmart Global Tech is responsible for ensuring the reliability, availability, and performance of all systems that power the world's largest retailer. As a Fortune #1 company, our work impacts hundreds of millions of customers and associates globally-across every transaction, search, and interaction spanning Walmart's digital and physical ecosystem. We are the guardians of system reliability for Walmart's e-commerce platform, supply chain systems, in-store technology, financial services, and all critical business operations.

Our Reliability Engineering organization is at the forefront of applying advanced AI/ML technologies to reliability challenges, building autonomous systems that can predict, prevent, and resolve issues before they impact customers or business operations. Reliability Engineering is a core engineering discipline within Walmart Global Tech, working closely with all product and engineering teams across the enterprise to ensure every system meets the highest standards of reliability, scalability, and performance.

We are deeply invested in building a robust, intelligent, and highly automated technology foundation that supports Walmart's mission to help people live better through innovation and operational excellence.

What You'll Do AI/ML & Agentic Systems Technical Leadership

* Architect and develop advanced agentic AI systems that autonomously manage complex reliability engineering workflows, predictive failure analysis, and self-optimization across Walmart's technology ecosystem.

* Design and implement multi-agent orchestration platforms that coordinate autonomous agents for change management, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems.

* Build intelligent observability and monitoring platforms using ML-driven anomaly detection, predictive analytics, and autonomous resolution across Walmart's entire technology landscape.

* Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically remediate system issues before they impact customers, associates, or business operations.

Reliability Engineering Technical Excellence

* Design, write, and build advanced tools to improve latency, availability, scalability and change management across Walmart Technology systems, including:
Engineering reliability using metrics and measurements across all domains Enabling system scaling through technical solutions, automation, and process optimization Building tools and automation to prevent recurrence of failures across mission-critical services Enhancing instrumentation to create a cohesive, end-to-end view of system health with particular focus on failure points

* Architect and implement fault-tolerant systems and services across Walmart's hybrid cloud infrastructure with emphasis on autonomous recovery and intelligent failure prediction.

* Collaborate with engineering teams and leadership to reduce mean time to detect (MTTD) and mean time to restore (MTTR) through intelligent automation and predictive capabilities.

* Partner with service owners across e-commerce, supply chain, stores, fintech, and other domains to define SLA breach detection and change related anomalies, ensuring systems meet SLAs while maintaining optimal performance and user experience.

* Perform complex troubleshooting and analysis of large-scale distributed systems using deep expertise in coding, algorithms, and…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary