Site Reliability Engineer III- Eng Job Alpharetta area,Georgia USA,IT/Tech

Why UKG

At UKG, the work you do matters. The code you ship, the decisions you make, and the care you show to a customer all add up to real impact. Today, tens of millions of workers start and end their days with our workforce operating platform, helping people get paid, grow in their careers, and shape the future of their industries. That’s what we do.

We never stop learning, we never stop challenging the norm, we push for better, and we celebrate the wins along the way. Here, you’ll get flexibility that’s real, benefits you can count on, and a team that succeeds together. Because at UKG, your work matters—and so do you.

Site Reliability Engineer III

Site Reliability Engineers (SREs) at UKG are experienced individual contributors who apply software engineering principles to operational challenges across the full service lifecycle. In this role, you will proactively monitor system health, manage risk through SLOs and error budgets, lead incident response, and enable safe, rapid change—balancing reliability with delivery velocity. SREs at UKG are passionate about learning and evolving with modern technologies.

We strive to innovate and relentlessly improve the customer experience, with an “automate everything” mindset that enables services to be delivered with speed, consistency, and high availability.

About the Role and

Job Responsibilities

Engage in and improve the lifecycle of services from conception to end‑of‑life, including system design reviews, capacity planning, and production readiness.
Contribute to standards and best practices for system architecture, service delivery, reliability, and automation, including definition and monitoring of service health indicators (latency, traffic, error rates, resource saturation); service level objectives (SLOs); and usage of error budgets to guide operational and delivery decisions.
Support service, product, and engineering teams by leveraging common tooling and frameworks to increase availability and improve incident detection and response.
Improve system performance, availability, and efficiency through automation, process refinement, post‑incident

#J-18808-Ljbffr