Site Reliability Engineering Manager
Listed on 2025-12-01
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Project Manager
IT Recruiter at Trustech Inc, dad, husband, and expert at turning off lights and closing drawers around the house
NOTE: Candidates requiring sponsorship now or in the future (including CPT/OPT) cannot be considered for this job
No C2C
Candidates will be required to work on site 3 days per week in south Salt Lake County
Onsite interviews are required. Local candidates only
OverviewWe are seeking a hands‑on Platform Engineering/SRE Manager to lead a small, high‑impact team responsible for maintaining and improving the reliability, performance, and scalability of our production systems. This role blends technical leadership and operational excellence, managing a group of Site Reliability and Platform Engineers who ensure our applications and infrastructure run smoothly in production.
The ideal candidate is a player‑coach, comfortable leading incident response efforts, mentoring engineers, and still contributing technically through infrastructure automation, observability improvements, and system reliability enhancements.
Key Responsibilities- Lead and mentor a team of SREs and Platform Engineers (currently five members) focused on production stability, system automation, and operational readiness.
- Own the reliability lifecycle, driving proactive monitoring, on‑call response leadership, and post‑incident reviews to minimize downtime and improve service quality.
- Develop and evolve infrastructure automation using Terraform, Helm, and related Infrastructure‑as‑Code practices to standardize deployments and reduce manual interventions.
- Partner with product, software, and operations teams to implement scalable cloud solutions that meet performance and resiliency targets.
- Oversee observability and telemetry using tools like Grafana, Azure Insights, Datadog, or Dynatrace, ensuring comprehensive visibility into system health.
- Drive the definition and tracking of SLOs, SLIs, and SLAs, helping teams measure and continuously improve reliability standards.
- Collaborate with engineering leads to enhance developer platform capabilities like automating workflows, managing CI/CD pipelines, and simplifying environment provisioning.
- Bachelor’s degree in Computer Science, Information Technology, or equivalent practical experience.
- 7+ years in infrastructure, SRE, or platform engineering roles, including 3+ years in leadership or team management.
- Strong background in cloud infrastructure (AWS, Azure, or GCP) and hands‑on experience with IaC tools such as Terraform.
- Familiarity with CI/CD pipelines, container orchestration, and deployment frameworks (e.g., Jenkins, Git Hub Actions, Kubernetes, Docker).
- Experience improving system observability, developing dashboards, and managing alerting systems using Grafana or similar platforms.
- Competence in Python, Go, or C# for automation and troubleshooting.
- Solid understanding of relational databases (SQL) and the ability to guide teams in identifying and resolving performance bottlenecks.
- Demonstrated ability to lead incident management, communicate effectively across teams, and create a culture of continuous improvement.
- Experience with developer enablement or internal platform engineering initiatives (e.g., self‑service infrastructure or environment provisioning).
- Familiarity with data‑driven operational metrics and applying analytics to improve system reliability.
- Prior experience managing a hybrid or remote technical team across time zones.
- Approximately 30% hands‑on technical contribution and 70% team leadership, process improvement, and coordination.
- Availability to participate in daytime and occasional off‑hours on‑call support rotations.
- Commitment to building a proactive, reliability‑first culture that values automation, transparency, and cross‑functional collaboration.
- Mid‑Senior level
- Full‑time
- Information Technology
- IT Services and IT Consulting
Referrals increase your chances of interviewing at Trustech by 2x
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).