Head of Platform & Reliability Engineering
Listed on 2026-03-05
-
IT/Tech
Systems Engineer, Cloud Computing, IT Project Manager, Cybersecurity
About the Opportunity
We operate a large-scale, real-time fleet technology platform supporting enterprise clients across North America. The company manages hundreds of thousands of connected mobile assets and delivers mission-critical telemetry, safety intelligence, and operational data 24x7.
As the organization continues to scale, we are seeking a senior technical leader to take full ownership of the infrastructure, cloud architecture, reliability engineering, and internal technology operations that power the platform.
This role is not purely managerial. It requires a hands-on leader who can architect systems, improve operational maturity, and build a high-performing engineering organization capable of supporting sustained growth.
Role Overview
The Head of Platform & Reliability Engineering is accountable for the performance, availability, security, and scalability of the company’s hybrid cloud and on-premise technology stack.
This individual will lead infrastructure engineering, Dev Ops, SRE, and corporate IT functions while establishing modern platform standards, strengthening operational discipline, and ensuring 24x7 service continuity.
You will serve as the executive owner of uptime, resilience, and infrastructure strategy.
What You Will Own
Platform Architecture & Operations
- Lead design and operation of hybrid cloud environments (Azure + data center)
- Ensure high availability, redundancy, and performance across production systems
- Architect secure networking, identity management, storage, backup, and monitoring solutions
- Drive cloud cost governance and resource optimization initiatives
- Establish standards for logging, alerting, access control, and infrastructure consistency
Reliability & Dev Ops Modernization
- Implement and scale CI/CD pipelines and Infrastructure as Code practices
- Lead Kubernetes architecture and container orchestration initiatives
- Introduce SRE principles including SLOs, SLIs, error budgets, and blameless postmortems
- Improve deployment velocity while reducing operational risk
- Strengthen change management discipline without slowing innovation
Security, Risk & Disaster Recovery
- Implement access governance, vulnerability management, and monitoring controls
- Establish incident response procedures and root cause analysis standards
- Define RPO/RTO targets and execute disaster recovery strategy
- Conduct regular recovery testing and maintain documented runbooks
- Ensure encryption, secrets management, and privileged access controls meet enterprise standards
Internal Technology & IT Operations
- Oversee corporate systems including endpoints, SaaS platforms, telecom, and collaboration tools
- Establish IT service management practices (incident, problem, asset, request workflows)
- Manage vendor relationships and licensing strategy
- Improve employee experience through reliable and secure internal systems
Leadership & Organizational Development
- Build and mentor a multidisciplinary team across infrastructure, Dev Ops/SRE, and IT support
- Establish operational metrics and executive reporting cadence
- Lead capacity planning to support company growth
- Foster a culture of accountability, ownership, and continuous improvement
Required Background
- 10+ years in infrastructure, cloud engineering, or platform operations
- 3–5+ years leading technical teams in production environments
- Deep experience with Microsoft Azure (compute, networking, identity, monitoring, cost management)
- Proven experience operating high-availability, customer-facing platforms
- Hands-on Kubernetes and container orchestration expertise
- Strong understanding of Infrastructure as Code (Terraform, ARM, Bicep, etc.)
- Experience implementing structured change management processes
- Direct ownership of disaster recovery planning and testing
- Security controls implementation experience in enterprise environments
- Ability to communicate technical risk and tradeoffs to executive leadership
Preferred Experience
- B2B SaaS or real-time data platforms
- Telematics, IoT, fleet technology, or distributed systems
- Observability tooling (Datadog, Prometheus/Grafana, Azure Monitor, etc.)
- ITSM platforms such as Jira Service Management or Service Now
- Experience scaling infrastructure to support rapid growth
- Exposure to compliance frameworks (SOC 2, ISO 27001, HIPAA, etc.)
- Relevant Azure or security certifications
What Success Looks Like
- Measurable improvement in uptime and MTTR
- Increased release velocity with lower deployment risk
- Predictable cloud cost management
- Tested and validated disaster recovery posture
- A cohesive, high-performing platform engineering organization
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).