Azure Platform Reliability Engineer IV
Listed on 2026-02-18
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
Introduction
Ahold Delhaize USA, a division of global food retailer Ahold Delhaize, is part of the U.S. family of brands, which also includes five leading omnichannel grocery brands – Food Lion, Giant Food, The GIANT Company, Hannaford and Stop & Shop. Ahold Delhaize USA associates support the brands with a wide range of services, including Finance, Legal, Sustainability, Commercial, Digital and E-commerce, Technology and more.
OverviewAhold Delhaize USA, a division of global food retailer Ahold Delhaize, is part of the U.S. family of brands, which includes five leading omnichannel grocery brands - Food Lion, Giant Food, The GIANT Company, Hannaford and Stop & Shop. Our associates support the brands with a wide range of services, including Finance, Legal, Sustainability, Commercial, Digital and E-commerce, Technology and more.
Platform Reliability Engineer will help ensure service availability, identify and automate manual processes, and bridge the gaps between product development teams and operations. Implementing operational improvements in availability, latency, performance, efficiency, change management, monitoring, incident response, patch management and capacity planning are all within scope for this role. Whether it’s done through code, the introduction of modern tools, and/or better processes - continuous improvement and efficiency is the goal.
You’ll provide operational excellence with troubleshooting skills and ownership in supporting various Azure services. The role also requires managing Linux servers in the cloud, and the candidate must have a strong Linux background.
Our flexible/hybrid work schedule includes 3 in-person days at one of our core locations and 2 remote days. Our core office locations are Salisbury, NC & Quincy, MA.
Applicants must be currently authorized to work in the United States on a full-time basis
Responsibilities- Builds, manages, and operates Azure Core Services with automation and infrastructure as code
- Manages and operates the continuous delivery framework and tools; manages and automates the lifecycle of platform components and supports product teams
- Leverages cloud architecture, applying site reliability principles and full‑stack troubleshooting skills across network, application, security, identity, OS (including Linux), containers, on‑prem, and distributed services
- Provides reasoning about system & application architecture; reviews code to improve reliability
- Identifies automation opportunities to improve patching, service health, manageability, reliability, and telemetry
- Owns, triages, investigates, and resolves service issues with a focus on communication, learning & teaching
- Designs process or technology solutions that monitor, identify, and resolve system and deployment issues pre‑ and post‑production, ensuring measurable KPI improvements
- Drives security and compliance for services in accordance with Azure compliance requirements
- Engages in service capacity planning, forecasting, and cost optimization
- Creates and documents Runbooks, operational procedures, and standards in Confluence
- Communicates at a deep technical level with engineering, PM, and product teams
- Works within agile/scrum project teams in a support role
- Remains current on new technologies, methods, coding practices, TDD, CI/CD, and operational excellence
- Manages, supports, and troubleshoots Linux servers and Linux-based workloads running in cloud environments
- Implements and automates Linux system operations, patching, performance tuning, and hardening
- Builds, manages, and operates Azure Core Services with automation and infrastructure as code
- Manages and operates the continuous delivery framework and tools; manages and automates the lifecycle of platform components and supports product teams
- Leverages cloud architecture, applying site reliability principles and full‑stack troubleshooting skills across network, application, security, identity, OS (including Linux), containers, on‑prem, and distributed services
- Provides reasoning about system & application architecture; reviews code to improve reliability
- Identifies automation opportunities to improve patching, service health, manageability,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).