×
Register Here to Apply for Jobs or Post Jobs. X

Principal Site Reliability Engineer

Job in Quincy, Norfolk County, Massachusetts, 02171, USA
Listing for: ViziRecruiter,LLC.
Full Time position
Listed on 2025-11-27
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Salary/Wage Range or Industry Benchmark: 146960 - 220440 USD Yearly USD 146960.00 220440.00 YEAR
Job Description & How to Apply Below

Introduction

Ahold Delhaize USA, a division of global food retailer Ahold Delhaize, is part of the U.S. family of brands, which also includes five leading omnichannel grocery brands – Food Lion, Giant Food, The GIANT Company, Hannaford and Stop & Shop. Ahold Delhaize USA associates support the brands with a wide range of services, including Finance, Legal, Sustainability, Commercial, Digital and E-commerce, Technology and more.

Overview

The Site Reliability Engineer (SRE) IV is a senior technical leader responsible for designing, guiding, and scaling site reliability engineering practices across complex, distributed systems. This role plays a crucial part in driving operational excellence, ensuring system resiliency, and fostering a high-performing engineering culture. The SRE IV works closely with senior leadership, engineering, and product teams to set strategic goals around availability, performance, and incident response while leading large-scale reliability initiatives.

This position emphasizes deep technical expertise in platforms such as Spring Boot, Java, Tomcat, Redis, and Kafka, along with infrastructure tooling like AKS, Kubernetes, ArgoCD, Terraform, Git Hub Actions, and observability platforms like Datadog. The ideal candidate will also bring strong experience working with Ubuntu/Linux environments, containerization with Docker, and automation of operational workflows across a modern Dev Ops toolchain.

Our flexible/hybrid work schedule includes 3 in-person days at one of our Chicago, IL office and 2 remote days.

Applicants must be currently authorized to work in the United States on a full-time basis.

Responsibilities
  • Architect, evolve, and lead implementation of enterprise-level SRE frameworks, tools, and cloud-native reliability strategies.
  • Build, scale, and manage microservices platforms using Spring Boot, Java, Tomcat, and Redis with Kubernetes and AKS.
  • Lead technical design reviews, chaos testing, and infrastructure planning with an emphasis on scalability, high availability, and fault tolerance.
  • Define, implement, and refine SLOs/SLIs and operational health indicators for business-critical services.
  • Automate infrastructure provisioning and application deployment workflows using Terraform, Git Hub Actions, and ArgoCD.
  • Drive observability and telemetry adoption using Datadog, including dashboards, alerts, custom metrics, and distributed tracing.
  • Act as incident commander during critical production issues; conduct blameless postmortems and guide root cause remediation.
  • Lead cross-team efforts in reducing mean time to detect (MTTD) and resolve (MTTR), and promoting self-healing systems.
  • Partner with security and compliance teams to ensure that systems are secure, auditable, and operationally compliant.
  • Enhance service resiliency through strategies including Kafka-based event-driven architecture, retries, rate limiting, and circuit breakers.
  • Mentor junior SREs and engineers, lead technical communities of practice, and promote a culture of continuous improvement.
  • Maintain and improve Ubuntu-based production systems and containerized workloads with Docker.
  • Evaluate and integrate emerging Dev Ops technologies to support scalability and reliability objectives.
Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field; equivalent practical experience may be considered.
  • 8+ years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering roles in large-scale production environments.
  • Expertise in building and maintaining Java-based microservices using Spring Boot, Tomcat, and Redis in containerized deployments.
  • Strong hands-on experience with Kubernetes, AKS, and ArgoCD for orchestration and Git Ops deployment workflows.
  • Proficiency in Python, Java, Bash, or Go for automation, scripting, and infrastructure tooling.
  • Proven ability to implement observability platforms and practices using Datadog (metrics, logs, traces, dashboards, alerts).
  • Advanced experience working with CI/CD pipelines using Git Hub and Git Hub Actions.
  • Deep understanding of networking, Linux (especially Ubuntu), distributed systems, and container security.
  • Experience operating message-driven architectures using Kafka, with an emphasis on throughput, retries, and resilience.
  • Solid knowledge of Terraform and infrastructure as code best practices.
  • Excellent communication, collaboration, and stakeholder alignment skills across engineering and business teams.

Salary Range: $146,960 - $220,440

Actual compensation offered to a candidate may vary based on their unique qualifications and experience, internal equity, and market conditions. Final compensation decisions will be made in accordance with company policies and applicable laws.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary