Senior Site Reliability Engineer Job Atlanta area,Georgia USA,IT/Tech

Senior Site Reliability Engineer About the Role The Site Reliability Engineering team at Todyl exists to make our platform reliable, secure, and easy for engineering teams to ship to. We do that by building automation, self-service tooling, and operational standards that let developers move fast without putting customers at risk.

Our success is measured by how much production reliability and developer velocity we enable, not by how much work flows through us. This is a senior individual contributor role. You'll own end-to-end design and delivery of the Kubernetes-based platform initiatives that shape how Todyl runs production over the next 2-3 years, mentor and uplevel the rest of the SRE team, and operate as a peer to Architecture and Security on high-stakes platform decisions.

The team is small and rebuilding after recent transitions, and you'll work alongside our Principal SRE as one of the senior anchors of the function.

In this role, we're looking for someone who:

* Has 5+ years of Site Reliability Engineering or platform-engineering experience and has owned major platform initiatives end-to-end, from design through stabilization, staying with the work until it's truly done rather than declaring victory y're recognized as the go-to person in their technical domain and create design documentation that their teams reference long after the work ships.

* Mentors less-tenured engineers as a matter of practice. They grow the people around them through pairing, design partnership, and the example they set.
* Sees SRE as a service to the engineering organization, not a gate. They build trust with developers and make other teams' jobs easier.

* Treats security as a normal part of operating the platform, not an afterthought, and brings demonstrated experience designing systems with security as a first-class concern.

* Gets energized by eliminating toil and looks at repetitive work and asks, "How do we make this go away?"

* Actively uses AI tooling in their day-to-day work, and influences how the team adopts AI patterns safely.

* Can communicate technical decisions clearly to engineers, engineering leadership, and non-engineering stakeholders, and is comfortable saying no or pushing back constructively when it matters.

What you'll do:

* Own end-to-end design and delivery of flagship platform initiatives, designing for failure modes, graceful degradation, and the scale we expect 12 months from now rather than just today. The headline 12-18 month deliverable for this role is the golden-path platform: a developer-facing self-service path to production that enforces infrastructure best practices without requiring SRE involvement.

* Drive security automation at platform scale, including patching cadence, secret rotation, access controls, and CVE remediation, as ongoing operational practices rather than reactive sprints.

* Partner with product engineering teams at the architecture phase of high-stakes systems, helping shape the design rather than reviewing it the week before launch.

* Operate as a peer to Architecture and Security on platform decisions that affect how Todyl runs production over the next 2-3 years.

* Mentor less-tenured SREs through pairing, code review, and design partnership, with measurable improvement in their autonomy on design and incident work.

* Contribute to one or more SRE practice improvements adopted by the team: incident commander discipline, postmortem maturity, change management standards, on-call quality, or design review cadence.

* Build and operate the production platform:
Kubernetes with Helm and ArgoCD, CI/CD pipelines, infrastructure-as-code (Terraform, Salt), observability (Grafana, Prometheus), secrets management, and AWS (including EKS). We're shifting from reactive to proactive, and we'd rather build guardrails than approve every deploy.

* Drive cost visibility and efficiency across our cloud footprint, including AWS resource tagging, COGs attribution, and right-sizing across the platform, and you'll quantify the business impact in terms that leadership can act on.

* Participate in a weekly on-call rotation, resolve most issues independently, and own postmortems and follow-up actions for the incidents you respond to.

* Plan and estimate honestly, break multi-quarter work into smaller increments, communicate delays early, and write tests for the automation you build because it runs in production.

* Treat code review as a quality lever, not a checkbox. Catch missing tests, push back on tech debt, and watch dashboards and logs to verify your own changes after they ship.

* When something you've built is mature and stable, you'll look for ways to hand it off or make it self-managing rather than holding onto it forever.

Important note:

We expect the person in this role to actively use AI tools, including tools like Claude, to accelerate automation development, reduce toil, and solve infrastructure problems more quickly. At the senior level, we also expect you to influence how the team adopts AI tooling:…