×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer; SRE

Job in Alpharetta, Fulton County, Georgia, 30239, USA
Listing for: Sierra Business Solution LLC
Full Time position
Listed on 2026-06-09
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
Job Description & How to Apply Below
Position: Site Reliability Engineer (SRE

-

Skill Set - Expertise in UNIX + LINUX Administration + AWS AZURE Cloud monitoring + Terraform Ansible + Prometheus Grafana observability experience).

Work Location - Alpharetta

Experience required for role - 6+ years

Production experience in SRE Infrastructure ops for large-scale systems

Strong programming scripting skills (Python, Go, Java, or equivalent)

Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

Infrastructure-as-code (Terraform, Helm, Cloud Formation, Ansible, etc.)

Familiarity with GPU AI compute clusters, high-performance data storage, and distributed architectures

Experience with monitoring observability logging alerting tools (Prometheus, Grafana, ELK EFK, Datadog, etc.)

Networking & systems engineering knowledge (TCPIP, DNS, routing, load balancing, distributed storage)

Solid experience in capacity planning, performance tuning, scaling, and incident response

Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements

Experience in regulated environments (financial services, compliance, audit, security) is a strong plus

Excellent communication, documentation, and cross-team collaboration skills

Proven track record of reducing operational toil via automation

Experience:

6+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineering knowledge.

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

Design and build automation for core platform capabilities, reducing manual toil

Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes container orchestration, etc.

Establish, monitor, and enforce SLOsSLIsSLAs, error budgets, alerting, and dashboards

Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

Optimize cost vs. performance tradeoffs in large-scale compute environments

Harden systems for security, compliance, auditability, and data governance

Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems

Define disaster recovery (DR) strategies, backup restore practices, fault tolerance mechanisms

Maintain runbooks, operational playbooks, documentation, and training materials

Participate in on-call rotations and respond to production incidents 247 as needed

Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Skills:

Digital :
Python Digital :
Docker Digital :
Kubernetes Digital :
Site Reliability Engineering (SRE)

Experience

Required:

6-8

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary