Site reliability Lead Engineer Job Fort Mill area,South Carolina USA,IT/Tech

Summary

A senior technical leader responsible for owning a reliability strategy, leading an SRE team, and ensuring the operational health, scalability, and availability of services. Combines hands-on engineering, automation, and people leadership to drive reliability across the organization.

Core responsibilities

Strategy & process

Define SRE strategy, process frameworks, standards, and best practices.
Establish SLIs, SLOs, and error budget policies; embed reliability into the SDLC.
Promote a culture of service ownership and maintain strong cross-team feedback loops.

Reliability & capacity

Oversee monitoring and maintenance to meet SLAs and uptime targets.
Drive capacity planning and forecasting to ensure performance at scale.
Use data and metrics to prioritize reliability investments and tradeoffs.

Automation & tooling

Lead automation efforts to eliminate operational toil and streamline runbooks.
Oversee Infrastructure as Code practices (for example Terraform, Cloud Formation) and configuration management.
Improve CI/CD pipelines to enable safer, faster releases.

Incident & change management

Lead incident response and communications during outages.
Conduct blameless postmortems and ensure corrective actions are executed.
Govern change control to ensure safe, tested production deployments.

Collaboration & communication

Partner with engineering, architecture, and product teams to bake reliability into designs and roadmaps.
Translate technical issues and tradeoffs for technical and nontechnical stakeholders.

Team leadership

Hire, mentor, and develop SRE engineers; set team goals and a roadmap.
Lead calmly and effectively under pressure during critical incidents and drive customer focused decisions.

Qualifications & skills

Technical

Proven SRE/Dev Ops/infrastructure experience (typically 6 years) with leadership experience (about 2 3 years).
Strong cloud experience (AWS preferred), containerization (Docker), and orchestration (Kubernetes).
Expertise with IaC and automation tools (Terraform, Cloud Formation, Ansible, or similar).
Proficient in scripting and programming for automation (Python, Bash, or similar).
Deep experience with monitoring and observability tooling (Prometheus, Grafana, ELK/ELK Stack, Splunk, Datadog, etc.).

Leadership & soft skills

Strong people leadership and coaching skills with proven stakeholder communication.
Excellent problem solving, analytical thinking, and adaptability.
Strategic mindset balancing engineering excellence with business priorities.

Deliverables

A measurable reliability roadmap aligned to business goals.
Reduced operational toil through automation and improved runbooks.
Clear SLIs, SLOs and established error budget governance.
A high performing SRE team with documented processes for incident and change management.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language