Site Reliability Engineer Job Decatur area,Georgia USA,IT/Tech

Job Description

Position Summary

We are seeking a highly experienced Site Reliability Engineer (SRE) to help design, operate, and continuously improve a highly available, secure, and cost-efficient multi-cloud platform. This role is deeply hands‑on and execution‑focused, with a strong bias toward action, measurable outcomes, and incremental progress.

The ideal candidate has deep expertise in AWS and Azure, strong instincts for efficient architecture, and a passion for observability, automation, and cost optimization (Fin Ops). You will work in a high‑compliance environment, partnering closely with engineering, product, and operations teams to ensure platform health, reliability, and scalability.

This role also requires leadership through influence—guiding teams, setting technical direction, and raising operational maturity—while remaining directly engaged in delivery.

Essential Duties and Responsibilities

Reliability & Platform Health

Design, implement, and operate reliable, scalable, and resilient systems across AWS and Azure
Establish and improve SLOs, SLIs, error budgets, and incident response practices
Lead root cause analysis and drive corrective actions to prevent recurrence
Continuously improve platform uptime, performance, and operational maturity

Observability & Operational Excellence

Design and maintain best‑in‑class observability (metrics, logs, traces, alerting)
Ensure actionable alerts with low noise and high signal
Use data to identify reliability risks, performance bottlenecks, and efficiency opportunities
Drive incremental improvements using clear goals and measurable outcomes

Fin Ops & Cost Optimization (Top Priority)

Own and drive cloud cost optimization initiatives across AWS and Azure
Partner with engineering and leadership to align cost with business value
Implement cost visibility, forecasting, and accountability practices
Identify architectural and operational improvements that reduce waste without sacrificing reliability or security

Security & Compliance

Operate within highly regulated environments with strong security controls
Support compliance efforts (SOC 2, CJIS preferred, NIST‑aligned practices)
Embed reliability and compliance requirements into platform design and operations
Partner with security teams to ensure secure‑by‑default systems

Automation & AI‑Enabled Efficiency

Automate operational workflows using Infrastructure as Code, CI/CD, and tooling
Leverage AI tools for analysis, incident investigation, cost insights, capacity planning, and operational efficiency
Continuously seek opportunities to move faster and smarter through automation and intelligent tooling

Leadership & Collaboration

Act as a technical leader and trusted partner to engineering teams
Guide and mentor others on reliability, observability, and cost‑efficient design
Influence architecture and operational decisions through data and collaboration
Drive initiatives end‑to‑end with accountability and ownership

Required Qualifications

Bachelor's Degree in Computer Science or Engineering
5+ years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering
Deep hands‑on experience operating production systems in AWS and Azure
Strong background in cloud architecture, reliability engineering, and automation
Proven experience with observability platforms (metrics, logging, tracing, alerting)
Demonstrated success driving cloud cost optimization / Fin Ops initiatives
Experience working in high‑compliance environments (SOC 2 required; CJIS a plus)
Strong scripting or programming skills (e.g., Python, Go, Bash, or similar)
Experience with Infrastructure as Code (e.g., Terraform, Cloud Formation, ARM/Bicep)

Preferred Qualifications

CJIS compliance experience
Experience supporting SaaS platforms serving public sector or regulated customers
Exposure to multi‑region, high‑availability architectures
Experience implementing or maturing Fin Ops practices at scale
Prior experience mentoring or leading cross‑functional technical initiatives

What Success Looks Like

Platform reliability and performance improve measurably over time
Cloud costs are visible, controlled, and optimized without compromising outcomes
Incidents are fewer,…


Increase/decrease your Search Radius (miles)



Job Posting Language