Lead Cloud Reliability Engineer Job Calgary area,Alberta Canada,IT/Tech

Careers Engineering

Lead Cloud Reliability Engineer

Job Description

Big Geo is the Spatial Cloud.

We help companies manage and access the world’s spatial data.

Any size, any slice, any insight.

Delivered in seconds.

We’re building something that hasn’t existed before: a new layer of the internet where the “where” and “when” behind every decision is instantly clear, programmable, and actionable. Our platform removes the complexity that has kept spatial data locked in silos for decades, and replaces it with speed, precision, and control.

We’re a Calgary-based company, early and moving fast, with real customers, real infrastructure, and a clear point of view on where the world is going.

Why Big Geo Exists and Why People Build Here

Most companies are spatially blind. They know what their data says, but not where or when things actually happen. That gap costs real money, creates real risk, and limits what AI can actually do in the physical world.

Big Geo exists to close that gap.

We’re not building another tool. We’re building the rails that connect the planet’s moving data to the systems that run the world. That’s a big problem, and it takes people who care about doing things right, not just fast.

People build here because:

The problem is real and the category is open. We’re not competing for the middle of an existing market. We’re defining a new one. Your work shapes what the category becomes.
Your fingerprints are on the architecture. We’re at the stage where the decisions you make today become the foundation tomorrow. What you ship matters.
We run on clarity, not politics. We move with purpose. No bureaucratic drag, no HiPPO decisions, just a team that agrees on the mission and gets to work.
You’ll grow fast because the problems are hard. Spatial data at scale is a genuinely difficult domain. If you want to be stretched, you’ll be stretched.
We’re building for longevity. We’re not chasing hype cycles. We’re building infrastructure, the kind that compounds in value over time and earns the trust of the companies that depend on it.

The Role

Big Geo is looking for a Lead Cloud Reliability Engineer to design and operate the systems that keep The Spatial Cloud running reliably s role sits at the intersection of hands‑on infrastructure engineering and technical leadership, and it carries real ownership over how dependable our platform feels to the customers, systems, and AI agents that run on top of it.

You’ll be responsible for the reliability architecture that supports spatial compute, data pipelines, and platform services across the Spatial Cloud. Working side‑by‑side with platform engineers, data engineers, and spatial compute teams, you’ll make sure the systems we ship are observable, resilient, and ready to handle large‑scale spatial workloads in production.

This is also a leadership seat. You’ll help set the reliability practices, operational standards, and automation systems that keep the platform stable as it scales across industries and global datasets. If you want to shape how a category‑defining infrastructure company runs in production, this is the role.

Key Responsibilities Reliability Architecture

Design reliability patterns for distributed services across the Spatial Cloud, including failure isolation, graceful degradation, and multi‑region resilience.
Ensure systems are fault‑tolerant, production‑ready, and capable of meeting well‑defined SLOs and error budgets.
Guide architectural decisions that materially improve platform stability, throughput, and predictability under load.

Observability and Monitoring

Build and maintain monitoring, logging, and tracing systems that give every engineer clear visibility into system health, latency, and saturation.
Define and maintain meaningful SLIs, SLOs, and alert thresholds that catch real problems without creating noise.
Create dashboards, runbooks, and alerting systems that turn raw telemetry into operational awareness the whole team can act on.

Incident Response and Recovery

Lead investigation and resolution of reliability incidents, including high‑severity production events.
Improve detection, escalation, and recovery processes so service disruptions are shorter,…