Site Reliability Engineer Job Alpharetta area,Georgia USA,IT/Tech

Role Overview

The Site Reliability Engineer will support Cyber Data Risk & Resilience by ensuring the reliability, availability, performance, and operational visibility of critical cybersecurity platforms and services. This role is responsible for keeping production systems running, instrumenting infrastructure and application layers, building meaningful monitoring and actionable alerting, supporting incident response, and continuously improving dashboards used by engineering, operations, risk, and executive stakeholders.

Responsibilities

Maintain and improve the reliability, availability, scalability, and performance of cybersecurity platforms, services, and supporting infrastructure
Support day-to-day operational stability by monitoring system health, identifying risks, responding to incidents, and driving timely resolution of service-impacting issues
Instrument infrastructure, applications, services, APIs, data pipelines, and cloud components to provide end-to-end visibility into system behavior and service health
Design, build, and continuously refine monitoring, alerting, logging, tracing, and observability capabilities across distributed systems and cloud environments
Develop meaningful and actionable alerts that reduce noise, improve signal quality, and enable teams to respond quickly to emerging issues
Define and track key reliability metrics, including availability, latency, throughput, error rates, saturation, service-level indicators, service-level objectives, and operational risk indicators
Build, maintain, and enhance dashboards for engineering, operations, product, risk, and executive stakeholders, ensuring information is accurate, timely, and decision‑ready
Continuously modify and improve executive dashboards to support regular leadership reviews of service health, reliability trends, incidents, risks, and operational performance
Partner with engineering, cybersecurity, infrastructure, cloud, and application teams to identify reliability gaps and implement long-term improvements
Participate in incident response, root‑cause analysis, problem management, and post‑incident reviews to prevent recurrence and improve operational maturity
Automate operational tasks, health checks, reporting, deployment validation, and recovery procedures to improve efficiency and reduce manual effort
Collaborate with application and platform teams to embed reliability, monitoring, and supportability requirements into the software development lifecycle
Support CI/CD, Dev Ops, and release management practices by validating operational readiness, monitoring coverage, rollback plans, and production support requirements
Contribute to resiliency engineering efforts, including capacity planning, performance tuning, failover validation, disaster recovery readiness, and chaos/resilience testing where applicable
Ensure monitoring, alerting, dashboards, and operational processes align with enterprise security, risk, compliance, and governance standards

Required Qualifications

7 to 10+ years of experience in site reliability engineering, systems engineering, software engineering, Dev Ops, infrastructure engineering, or production operations
Strong experience supporting highly available, distributed, cloud‑based, or mission‑critical technology platforms
Hands‑on experience with observability practices, including monitoring, alerting, logging, metrics, tracing, dashboards, and service health reporting
Experience instrumenting applications, services, APIs, infrastructure, databases, and cloud components to enable end‑to‑end operational visibility
Strong understanding of reliability engineering concepts, including SLIs, SLOs, SLAs, error budgets, incident management, capacity management, and operational readiness
Experience designing actionable alerts that support rapid issue detection, triage, escalation, and resolution
Experience building and maintaining operational dashboards for technical teams, support teams, and senior/executive stakeholders
Strong scripting or programming skills using Python, Java, Bash, Power Shell, or similar languages for automation and operational tooling
Experience with cloud platforms such as AWS, Azure, or…