×
Register Here to Apply for Jobs or Post Jobs. X

Principal, Site Reliability Engineer

Job in Bentonville, Benton County, Arkansas, 72712, USA
Listing for: Walmart
Full Time position
Listed on 2026-06-18
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing: Infrastructure & Operations, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 150000 - 200000 USD Yearly USD 150000.00 200000.00 YEAR
Job Description & How to Apply Below

Position Summary

The (USA) Principal, Site Reliability Engineer leads the design, development, and implementation of reliability programs for complex site environments. This role ensures system performance, scalability, and disaster recovery through advanced monitoring, root cause analysis, and infrastructure automation. The position requires expertise in software architecture, distributed systems, and cloud technologies to optimize operational efficiency and resilience. The Principal Engineer collaborates across teams to drive continuous improvement, establish reliability standards, and support business objectives by delivering robust, scalable, and secure solutions aligned with organizational goals.

About

the team

The CES team delivers exceptional customer service experiences to millions of Walmart customers and agents worldwide. Comprising software engineers, data scientists, and machine learning experts, the team advances GenAI technology within complex enterprise applications. As part of Walmart Global Tech’s Enterprise Business Systems, CES collaborates closely with product, business, and UX teams to drive measurable business outcomes. The team focuses on innovation, reliability, and scalability to support Walmart’s mission of helping customers save money and live better through cutting‑edge technology and robust site reliability engineering practices.

What

you'll do
  • Design and develop reliability programs tailored to complex site environments, ensuring alignment with business goals and site safety engineering.
  • Lead and facilitate reliability testing and chaos experiments to validate application resiliency and system performance.
  • Analyze system architecture and performance to optimize scalability, disaster recovery, and operational efficiency.
  • Develop and implement monitoring strategies, establishing metrics and alerts to maintain system availability and reliability.
  • Guide root‑cause analysis efforts to identify and resolve defects, enhancing application stability and preventing incidents.
  • Drive infrastructure automation and telemetry integration to support continuous delivery and operational excellence.
  • Mentor team members on tools, coding standards, and reliability best practices.
What you'll bring
  • Extensive experience in site reliability engineering with strong expertise in system monitoring, root cause analysis, and reliability analysis.
  • Proficiency in designing scalable, modular, and extensible software architectures aligned with business and technical requirements.
  • In‑depth knowledge of disaster recovery planning, execution, and contingency procedures for complex site environments.
  • Skilled in cloud computing platforms and containerization technologies such as Docker.
  • Ability to lead reliability testing and chaos engineering experiments using open‑source tools.
  • Strong coding skills in languages like JavaScript and Python, with automation experience in CI/CD pipelines.
  • Proven capability to analyze system performance and implement telemetry for continuous improvement.
Minimum Qualifications
  • Bachelor’s degree in computer science, computer engineering, computer information systems, software engineering, or related area AND 5 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.
  • Or 7 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.
Preferred Qualifications
  • Experience in site reliability engineering, site and system administration, infrastructure management, or related area.
  • Master’s degree in site reliability engineering or related area AND 3 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.
  • SRE certification (for example, IBM Cloud Site Reliability Engineer).
  • Knowledge of accessibility best practices and ability to create inclusive digital experiences in accordance with Walmart’s accessibility standards and guidelines.
Primary Location

2501 Se J St, Ste A, Bentonville, AR , United States of America

Benefits you’ll enjoy

Health benefits include medical, vision, and dental coverage.

Financial benefits include 401(k), stock purchase and company‑paid life insurance.

Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.

Other benefits include short‑term and long‑term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement.

Live Better U: 100% covered tuition, books and other costs for accredited programs.

Competitive pay along with performance‑based bonus awards.

Drug‑free workplace policy: a no‑tolerance stance toward illegal drugs and alcohol on the job.

Bentonville, Arkansas US-10735:
The annual salary range for this position is $ - $.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary