Site Reliability Engineering Lead Job Halifax area,Nova Scotia Canada,IT/Tech

Welcome to Haleon. We’re a purpose-driven, world-class consumer company putting everyday health in the hands of millions. In just three years since our launch, we’ve grown, evolved and are now entering an exciting new chapter – one filled with bold ambitions and enormous opportunity.

Our trusted portfolio of brands – including Sensodyne®, Panadol®, Advil®, Voltaren®, Theraflu®, Otrivin®, and Centrum® – lead in resilient and growing categories. What sets us apart is our unique blend of deep human understanding and trusted science.

Now it’s time to fully realise the full potential of our business and our people. We do this through our Win as One strategy. It puts our purpose – to deliver better everyday health with humanity – at the heart of everything we do. It unites us, inspires us, and challenges us to be better every day, driven by our agile, performance-focused culture.

Purpose of the Role:

As an SRE Lead, in this newly created role, you will shape the future of Site Reliability Engineering( SRE) within our Commercial Tech organization.

You will provide technical leadership and strategic direction in all aspects of site reliability engineering — from designing and implementing observability frameworks, automation, and incident response processes, to ensuring seamless delivery and stability of large-scale systems. You will play a pivotal role in shaping best practices, guiding cross-functional teams, and embedding reliability into every stage of the engineering lifecycle.

Role responsibilities:

This role will provide YOU the opportunity to lead key activities to progress YOUR career. These responsibilities include some of the following:

Drive reliability, scalability, and performance across critical technology platforms to ensure seamless digital experiences.

Lead the design and implementation of modern observability practices, with a particular focus on Datadog.

Act as a bridge between development and operations, championing automation, resilience engineering, and incident management.

Align reliability goals with business objectives while proactively identifying, troubleshooting, and resolving complex system issues.

Build customized dashboards and configure advanced alerts (multi‑condition, anomaly detection, composite monitors).

Use Application Performance Monitoring (APM) to trace distributed systems and implement log pipelines for troubleshooting.

Leverage Datadog APIs for automation and CI/CD integration; connect with cloud providers (AWS, Azure, GCP), containers (Kubernetes, Docker), and serverless functions.

Apply Datadog analytics for capacity planning, performance tuning, and cost optimization.

Integrate Datadog with security monitoring, compliance dashboards, and business KPIs.

Lead incident management using real‑time data to reduce MTTR.

Coach teams on effective Datadog usage, establish observability standards and act as the go‑to expert for monitoring strategy.

Define Datadog tagging standards to ensure consistent metadata, traceability, and cost allocation.

Establish a framework for Datadog cost attribution, enabling transparency and accountability for monitoring expenses.

Develop a Target Operating Model for observability, including ownership guidelines and a “who to contact” matrix.

Create a structured logging strategy that identifies valuable logs, reduces noise, and ensures compliance with data privacy.

Design a proactive alerting strategy to minimize end‑user incidents, reduce false positives, and prioritize actionable alerts.

Set appropriate service and error thresholds for SLOs/SLAs and monitors, clearly defining failure criteria to align with business expectations.

Basic Qualifications:

We are looking for professionals with these required skills to achieve our goals:

Min. Bachelor’s degree in computer science, Engineering, or related field

8+ years in Site Reliability Engineering, Dev Ops, or Infrastructure roles, with at least 3 years in a leadership capacity

Deep hands-on experience with Datadog for observability, monitoring, alerting, and performance optimization

Extensive knowledge of cloud platforms (AWS, Azure, or GCP) and container orchestration (Kubernetes, Docker)

Proficiency in…


Increase/decrease your Search Radius (miles)



Job Posting Language