Senior Site Reliability Engineer
Listed on 2026-01-26
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Bolt On Technology is a leading provider of software solutions for the automotive aftermarket, empowering thousands of repair shops across North America with tools that streamline operations, enhance customer communication, and drive business growth. Their award-winning platform enables digital vehicle inspections, automated messaging, scheduling, payments, and more, helping shops boost revenue, increase efficiency, and build stronger customer relationships. Bolt On is committed to transforming the vehicle service industry with innovative technology and exceptional support.
Bolt On is part of the Performant Capital portfolio. Performant Capital is a Chicago-based private equity firm that partners with mission-critical software and tech-enabled services companies to accelerate growth and operational excellence. With deep expertise in SaaS and technology investments, Performant works closely with leadership teams to drive innovation, scale products, and expand market reach.
The Senior Site Reliability Engineer is responsible for ensuring the reliability, scalability, performance, and security of our production systems. This role blends software engineering and systems engineering to build resilient infrastructure, improve automation, and proactively reduce operational risk. The Senior SRE will serve as a technical leader, driving best practices across observability, incident response, and platform stability.
Key Responsibilities- Design, build, and maintain highly available, scalable, and fault-tolerant systems
- Lead reliability improvements across production and non-production environments
- Own and improve monitoring, alerting, and observability platforms
- Drive incident response, root cause analysis, and post-incident reviews
- Implement automation to reduce manual operational work
- Partner with Engineering, Security, and Product to support platform needs
- Establish and track SLIs, SLOs, and error budgets
- Lead capacity planning and performance tuning efforts
- Improve deployment, CI/CD, and infrastructure-as-code practices
- Identify and mitigate reliability and scalability risks before they impact customers
- Mentor and guide junior engineers and contribute to team technical standards
- Participate in on-call rotation and help mature on-call processes
- 6+ years of experience in Site Reliability Engineering, Dev Ops, Platform Engineering, or related roles
- Strong experience with cloud platforms (AWS, Azure, or GCP)
- Proficiency with infrastructure as code (Terraform, Cloud Formation, Pulumi, etc.)
- Experience with containerization and orchestration (Docker, Kubernetes)
- Strong Linux systems administration and networking fundamentals
- Experience building and maintaining CI/CD pipelines
- Hands-on experience with monitoring and observability tools (Datadog, Prometheus, Grafana, New Relic, etc.)
- Strong troubleshooting and incident management skills
- Experience with scripting and automation (Python, Bash, Go, or similar)
- Experience designing multi-region or highly distributed systems
- Experience with security best practices and compliance in production environments
- Experience supporting high-availability SaaS platforms
- Experience in a fast-growing or PE-backed environment
- Experience influencing reliability culture across engineering teams
- Ownership and accountability
- Strong communication during incidents and escalations
- Bias toward automation and continuous improvement
- Ability to balance speed and stability
- Mentorship and technical leadership
- Calm under pressure
- Improved system uptime and reduced incident frequency
- Faster incident detection and resolution times
- Increased automation and reduced manual operational work
- Clear SLOs and reliability metrics adopted by engineering teams
- Strong cross-team trust in platform stability
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).