×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Principal Specialist, IT Operations

Job in Ottawa, Ontario, Canada
Listing for: Sherweb Inc.
Full Time position
Listed on 2026-06-04
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Job Description & How to Apply Below

Site Reliability Principal Specialist, IT Operations

Job Category :
Cloud and Systems Infrastructure

Requisition Number : SITER

  • Posted :
    March 12, 2026
  • Full-Time
  • Remote
Locations

Showing 1 location

Canada

Description

Location:

Remote (from Canada)

Here’s what we do and why we do it

We work to simplify the cloud for IT professionals so they can focus on what really matters, making their customers’ lives better. Find out how we do that here:

Overview

The Site Reliability Principal Specialist on the IT Operations team implements a proactive, resilient, and scalable approach to site reliability across all Sherweb platforms.”

This is a senior technical individual contributor position responsible for shaping how reliability is designed, governed, and sustained across systems. The role elevates reliability from reactive operations to an engineered discipline—intentional, measurable, and scalable—ensuring platforms operate predictably as Sherweb grows in scale, complexity, and customer impact. As Sherweb continues to expand its platforms and global customer footprint, reliability becomes a core business capability.

Operating at a broad organizational scope, this role acts as a principal-level technical leader across IT Operations. It sets reliability direction and drives consistency through technical authority, influence, and partnership. The role serves as a technical counterpart to senior engineering, infrastructure, and platform leaders to shape operational strategy across multiple teams.

Here’s how you will contribute to the success of the company

  • Define and evolve reliability standards across platforms and services, including service level objectives (SLOs), service level indicators (SLIs), to improve mission-critical services.
  • Establish a shared reliability language and expectations across IT Operations Teams.
  • Drive consistency in monitoring and operational practices across services, systems and platforms.
  • Influence system and operational design to improve reliability, availability and resilience.
  • Drive the reduction of operational toil through automation, AI, platform capabilities, and repeatable operational patterns.
  • Improve end to end observability and system understanding, enabling teams to reason clearly about system behavior and failure modes. Improves logging, metrics, tracing, and telemetry across systems.
  • Enable teams to take end to end ownership of platform reliability, including deeper investigation across infrastructure and application layers.
  • Partner closely with infrastructure and platform teams to ensure access, tooling, and visibility support full operational ownership and to drive reliability improvements.
  • Act as a reliability advocate and technical advisor during operational reviews, incident learning, and platform evolution.
  • Partner closely with Dev Ops teams to implement reliability and observability as code, ensuring integration with CI/CD pipelines and platform tooling.

Here’s what you need to have and master to get the job

Education

  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field, or equivalent practical experience.

Experience

  • 10+ years of experience in Site Reliability Engineering, operating and improving large scale, production environments.
  • Demonstrated experience improving the reliability, availability, and scalability of production systems, platforms and services.
  • Handson experience operating distributed systems in business critical and customer facing environments.
  • Proven experience reducing manual operational work through automation and standardization.
  • Experience defining and applying reliability standards (e.g., SLOs, error budgets) across multiple services or platforms.
  • Demonstrated ability to influence technical direction across multiple teams without direct authority.

Core skills

  • Strong understanding of distributed systems, failure modes, and operational resilience.
  • Solid experience with observability practices (metrics, logs, traces) and system diagnostics.
  • Ability to analyze complex systems end to end across infrastructure, platform, and application layers.

Technical leadership

  • Strong systems thinking with a track record of addressing reliability…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary