Platform Engineer; Cloud Reliability Engineer Job Ottawa area,Ontario Canada,IT/Tech

Position: Platform Engineer (Cloud Reliability Engineer)

Job Title:

Platform Engineer (Cloud Reliability Engineer)

Reports to:

Director, Global Operations

Based in:

Ottawa, ON

Term:

Full Time

About Nanometrics:

With 40 years of seismic technology and industry application experience, we are a global, award-winning company providing monitoring solutions and equipment for studying artificial and natural seismicity. From mission-critical seismic arrays, tsunami and early earthquake warning systems in over 90 countries across the globe to induce seismicity monitoring in the energy sector. We specialize in full-service, integrated solutions for studying artificial and natural seismicity, including turnkey seismic networks, industry-leading precision instrumentation, complete data processing, analysis services, and software applications.

At Nanometrics, we take pride in fostering a culture of innovation, collaboration, and excellence. We are passionate about making a global impact through cutting-edge technology while staying rooted in values of intentional innovation, trust, ethics, and stability.

About the role:

This is an exciting opportunity for a motivated and experienced Platform Engineer to evolve, enhance and lead the technological footprint of our Seismic Monitoring Services portfolio. Nanometrics provides a top tier portfolio of tools and services which is supported by a continuously evolving cloud based platform.

The Platform Engineer / Cloud Reliability Engineer ensures the reliability, performance, and operational excellence of cloud-hosted seismic monitoring and data processing services. This role blends software engineering, cloud infrastructure management, and SRE practices to build resilient systems, reduce manual toil through automation, and improve observability across AWS and Kubernetes ecosystems.

The successful candidate will use Terraform or similar Infrastructure-as-Code technologies (Pulumi, AWS CDK, Cloud Formation, Open Tofu) to deliver consistent, automated, scalable infrastructure.

Responsibilities:

Cloud Reliability & Resilience

Ensure uptime, performance, and reliability of AWS-hosted services and Kubernetes workloads

Implement self-healing patterns, automated rollbacks, health checks, and safe-deployment strategies

Participate in on-call rotation and lead first-response triage for cloud and platform incidents

Build and maintain service-level indicators (SLIs) and service-level objectives (SLOs)

Automation & Infrastructure Engineering

Develop automation for cloud operations using Python, Bash, and IaC (Terraform)

Reduce operational toil through automated runbooks, event-driven remediation, and system orchestration

Improve deployment reliability in collaboration with Platform Engineering and R&D teams

Implement and refine configuration standards, CI/CD hygiene, and environment stability

Observability & Operational Intelligence

Maintain and extend observability stack (Prometheus, Grafana, Influx

DB, Open Telemetry)

Tune alerts for accuracy, reduce noise, and implement actionable alerting tied to SLOs

Analyze logs, metrics, and traces to detect reliability issues and validate system behavior

Build dashboards that provide real-time visibility into system health and reliability trends

Operational Excellence

Support release processes, platform upgrades, and cloud infrastructure changes

Conduct root-cause analysis and drive post-incident corrective actions

Maintain operational documentation, runbooks, and environment validation workflows

Collaborate cross-functionally with Net Ops, Platform Engineering, Field Ops, and R&D

Requirements:

Education and Experience

Bachelor's degree or higher in Software Engineering, Computer Science, or related field.

7+ years experience in software development

3+ years hands-experience working with cloud providers like AWS, etc and cloud-native technologies like Kubernetes, Helm, etc. and related technologies including observability platforms.

Experience with database operations (MySQL, Postgre

SQL, Mongo

DB, Redis) in cloud and on-prem environments.

Cloud & Infrastructure

Strong experience with AWS (EC2, S3, IAM, VPC, EKS/ECS, Cloud Watch)

Solid understanding of Kubernetes, Helm charts, and container orchestration

Familiarity with hybrid cloud environments (cloud + on-prem integration)

Infrastructure as Code & Automation

Hands-on experience with Terraform

Scripting skills in Python and Bash

Ability to build automated workflows and cloud operations tooling

CI/CD & Deployment Engineering

Experience with deployment pipelines (Jenkins, Bitbucket Pipelines, ArgoCD)

Familiarity with Git Ops workflows

Understanding of build systems (Maven, Gradle)

Monitoring & Observability

Experience with monitoring/metrics/logging tools such as Prometheus, Grafana, InfluxDB

Familiarity with Open Telemetry for distributed tracing

Ability to diagnose performance issues in distributed systems

Reliability Engineering Concepts

Knowledge of SLOs/SLIs/error budgets

Incident management principles

Understanding of resilience patterns (retry, circuit breakers, autoscaling, etc.)

Why Nanometrics?

We are a…


Increase/decrease your Search Radius (miles)



Job Posting Language