Lead Site Reliability Engineer Job Atlanta area,Georgia USA,IT/Tech

Overview

We are seeking a Lead Site Reliability Engineer to spearhead our SRE team. You are not just an operator; you are an experienced software engineer who excels at architecture, code optimization, and deep troubleshooting. In this role, you will drive operational maturity by defining our reliability standards (SLOs), hardening our security posture (WAF/Infra Sec), and scaling the Intellum platform.

Our stack

Core
:
Applications written in Ruby on Rails and Node.js, Postgre

SQL, Mongo

DB, Redis, Memcached, Sidekiq, Active Job, Elasticsearch, Websockets
Infrastructure
: 100% Linux-based cloud infrastructure (AWS, Google Cloud, Mongo

DB Atlas) and services (ECS/EC2/Kubernetes, Elasticache, Memory Store, RDS, Cloud

SQL, Big Query, etc.)
Infrastructure as Code (IaC):
Git Hub, Terragrunt, Terraform, Ansible
CI/CD
:
Spinnaker, Jenkins
Observability & Alerting
:
New Relic, AWS Cloud Watch, Google Cloud Stackdriver, Squadcast
Agile/Scrum practices utilizing JIRA

Responsibilities

sRE Leadership & Strategy:
Set clear goals for the SRE team and partner with Engineering leadership to align platform initiatives with business objectives
Reliability & Observability (SLA/SLO):
Lead the definition and enforcement of SLAs, SLIs, and SLOs. Architect observability frameworks to translate telemetry data into actionable roadmaps that reduce toil and enhance resilience
Core Engineering & Performance:
Take ownership of critical code components (e.g., Queues, Enrollments) and lead efforts to identify bottlenecks, optimize performance, and improve code quality across the engineering department
Security by Design:
Champion infrastructure security. Partner with Info Sec to define hardening standards, manage perimeter defense (WAF/DDoS), and automate vulnerability remediation within the CI/CD pipeline
Incident Command:
Participate in the 24x7 on-call rotation and lead post-incident reviews (RCAs), ensuring action items are implemented to improve MTTR and prevent recurrence
Mentorship:
Empower developers with better tooling and guidance on performant coding practices, fostering a culture of collaboration and reliability and "you build it, you run it"

Required Skills & Experience

10+ years of engineering experience, with 5+ years specifically developing Ruby on Rails applications
Expertise in Cloud Computing (AWS/GCP) and Infrastructure as Code (Terraform/Ansible)
Strong proficiency with SQL databases (Postgre

SQL) and the ability to quickly navigate and optimize complex, unfamiliar codebases

Additional SRE & Operations

Deep Observability:
Proven experience designing monitoring solutions (Datadog, New Relic, Prometheus) based on the "Golden Signals"
SLO Governance:
Demonstrated ability to define SLIs/SLOs from scratch, negotiate Error Budgets, and use data to balance feature velocity with reliability
Security Focus:
Experience securing cloud environments and container platforms (Kubernetes), including hands-on management of WAF rules and edge security
Incident Management:
Experience leading post-incident reviews (RCAs) and implementing action items that directly improve MTTR and MTTD

Leadership & Collaboration

Proven experience leading technical teams, mentoring engineers, and working in a team-oriented, collaborative environment with strong communication skills
Documentation & Training:
Skilled in documenting solutions and training operational teams on how to effectively support and maintain systems
Proactive Problem-Solving:
Demonstrated ability to communicate clearly, seek help proactively, and take ownership of tasks to completion

Bonus Skills

Automation Tools:
Experience in developing solutions using server automation tools such as Terraform, Ansible
CI/CD Expertise:
Experience in writing and maintaining CI/CD pipelines and services
Kubernetes:
Experience in building, deploying, and optimizing Kubernetes-based infrastructure
Perimeter Defense:
Experience configuring and managing Web Application Firewalls (WAF) and DDoS protection mechanisms

Education

Bachelor’s degree in Computer Science or related technical field

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language