Site Reliability Principal Specialist, IT Operations
Listed on 2026-06-04
-
IT/Tech
Systems Engineer, SRE/Site Reliability
Here’s what we do and why we do it:
We work to simplify the cloud for IT professionals so they can focus on what really matters, making their customers’ lives better.
Location:
Remote (from Canada).
The Site Reliability Principal Specialist on the IT Operations team implements a proactive, resilient, and scalable approach to site reliability across all Sherweb platforms. This senior technical individual contributor role shapes how reliability is designed, governed, and sustained across systems, elevating reliability from reactive operations to an engineered discipline—intentional, measurable, and scalable—ensuring platforms operate predictably as Sherweb grows in scale, complexity, and customer impact.
Operating at a broad organizational scope, this role acts as a principal-level technical leader across IT Operations. It sets reliability direction and drives consistency through technical authority, influence, and partnership, serving as a technical counterpart to senior engineering, infrastructure, and platform leaders to shape operational strategy across multiple teams.
- Define and evolve reliability standards across platforms and services, including SLOs, SLIs, to improve mission‑critical services.
- Establish a shared reliability language and expectations across IT Operations Teams.
- Drive consistency in monitoring and operational practices across services, systems and platforms.
- Influence system and operational design to improve reliability, availability and resilience.
- Drive the reduction of operational toil through automation, AI, platform capabilities, and repeatable operational patterns.
- Improve end‑to‑end observability and system understanding, enabling teams to reason clearly about system behavior and failure modes. Improves logging, metrics, tracing, and telemetry across systems.
- Enable teams to take end‑to‑end ownership of platform reliability, including deeper investigation across infrastructure and application layers.
- Partner closely with infrastructure and platform teams to ensure access, tooling, and visibility support full operational ownership and to drive reliability improvements.
- Act as a reliability advocate and technical advisor during operational reviews, incident learning, and platform evolution.
- Partner closely with Dev Ops teams to implement reliability and observability as code, ensuring integration with CI/CD pipelines and platform tooling.
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field, or equivalent practical experience.
Experience- 10+ years of experience in Site Reliability Engineering, operating and improving large‑scale, production environments.
- Demonstrated experience improving the reliability, availability, and scalability of production systems, platforms and services.
- Hands‑on experience operating distributed systems in business‑critical and customer‑facing environments.
- Proven experience reducing manual operational work through automation and standardization.
- Experience defining and applying reliability standards (e.g., SLOs, error budgets) across multiple services or platforms.
- Demonstrated ability to influence technical direction across multiple teams without direct authority.
- Strong understanding of distributed systems, failure modes, and operational resilience.
- Solid experience with observability practices (metrics, logs, traces) and system diagnostics.
- Ability to analyze complex systems end‑to‑end across infrastructure, platform, and application layers.
- Strong systems thinking with a track record of addressing reliability issues through design rather than reactive intervention.
- Experience acting as a trusted technical advisor to senior engineers and leaders.
- Ability to clearly communicate complex reliability concepts to both technical and nontechnical stakeholders.
- Cloud platform:
Microsoft Azure Solutions Architect Expert or Dev Ops Engineer Expert would be strong assets. - Certifications related to reliability, operations, or systems engineering (e.g., Kubernetes, Linux, or observability platforms) are considered an asset.
- Equivalent…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: