×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Alpharetta, Fulton County, Georgia, 30239, USA
Listing for: Russell Tobin
Part Time, Contract position
Listed on 2026-05-30
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, Cybersecurity, IT Support
Salary/Wage Range or Industry Benchmark: 74 - 78 USD Hourly USD 74.00 78.00 HOUR
Job Description & How to Apply Below

Location: Alpharetta, GA (Hybrid 3 Days a Week Onsite)

Duration: 12 Months Contract + Possible Extension

Payrate: $74 - 78 per hour

Job Description

The Site Reliability Engineer will support Cyber Data Risk & Resilience by ensuring the reliability, availability, performance, and operational visibility of critical cybersecurity platforms and services.

This role is responsible for keeping production systems running, instrumenting infrastructure and application layers, building meaningful monitoring, and actionable alerting, supporting incident response, and continuously improving dashboards used by engineering, operations, risk, and executive stakeholders.

Responsibilities
  • Maintain and improve the reliability, availability, scalability, and performance of cybersecurity platforms, services, and supporting infrastructure
  • Support day‑to‑day operational stability by monitoring system health, identifying risks, responding to incidents, and driving timely resolution of service‑impacting issues
  • Instrument infrastructure, applications, services, APIs, data pipelines, and cloud components to provide end‑to‑end visibility into system behavior and service health
  • Design, build, and continuously refine monitoring, alerting, logging, tracing, and observability capabilities across distributed systems and cloud environments
  • Develop meaningful and actionable alerts that reduce noise, improve signal quality, and enable teams to respond quickly to emerging issues
  • Define and track key reliability metrics, including availability, latency, throughput, error rates, saturation, service‑level indicators, service‑level objectives, and operational risk indicators
  • Build, maintain, and enhance dashboards for engineering, operations, product, risk, and executive stakeholders, ensuring information is accurate, timely, and decision‑ready
  • Continuously modify and improve executive dashboards to support regular leadership reviews of service health, reliability trends, incidents, risks, and operational performance
  • Partner with engineering, cybersecurity, infrastructure, cloud, and application teams to identify reliability gaps and implement long‑term improvements
  • Participate in incident response, root‑cause analysis, problem management, and post‑incident reviews to prevent recurrence and improve operational maturity
  • Automate operational tasks, health checks, reporting, deployment validation, and recovery procedures to improve efficiency and reduce manual effort
  • Collaborate with application and platform teams to embed reliability, monitoring, and supportability requirements into the software development lifecycle
  • Support CI/CD, Dev Ops, and release management practices by validating operational readiness, monitoring coverage, rollback plans, and production support requirements
  • Contribute to resiliency engineering efforts, including capacity planning, performance tuning, failover validation, disaster recovery readiness, and chaos/resilience testing where applicable
  • Ensure monitoring, alerting, dashboards, and operational processes align with enterprise security, risk, compliance, and governance standards
Required Qualifications
  • 10+ years of experience in site reliability engineering, systems engineering, software engineering, Dev Ops, infrastructure engineering, or production operations
  • Strong experience supporting highly available, distributed, cloud‑based, or mission‑critical technology platforms
  • Hands‑on experience with observability practices, including monitoring, alerting, logging, metrics, tracing, dashboards, and service health reporting
  • Experience instrumenting applications, services, APIs, infrastructure, databases, and cloud components to enable end‑to‑end operational visibility
  • Strong understanding of reliability engineering concepts, including SLIs, SLOs, SLAs, error budgets, incident management, capacity management, and operational readiness
  • Experience designing actionable alerts that support rapid issue detection, triage, escalation, and resolution
  • Experience building and maintaining operational dashboards for technical teams

Russell Tobin is an equal opportunity employer. We do not discriminate on the basis of the race, religious creed, color, national origin, ancestry, physical disability, mental disability, reproductive health decision making, medical condition, genetic information, marital status, sex, gender, gender identity, gender expression, age, sexual orientation, veteran or military status, or any other characteristic protected by applicable federal, state, or local law.

Russell Tobin is a Fair Chance employer. We consider all qualified applicants, including those with criminal histories, in a manner consistent with applicable state and local Fair Chance laws and ordinances, including the California Fair Chance Act and all applicable local Fair Chance ordinances.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary