More jobs:
Job Description & How to Apply Below
Denvr is a vertically integrated AI Platform Services company headquartered in Calgary, Canada. We provide foundational compute infrastructure and services to support the broader AI ecosystem and its end users. The platform includes cloud‑native solutions for training, inference, high‑performance computing, data processing, scalable storage, and a suite of software toolsets that accelerate the development, deployment, and integration of AI applications.
These capabilities are accessible via the public Denvr AI Cloud or through Private AI Platform Services, which offer fully dedicated, sovereign environments with enhanced security. Private deployments incorporate advanced data centers, optimized compute architectures, high‑throughput storage fabrics, and tightly integrated platform operations software—engineered to meet the demands of large‑scale, mission‑critical AI workloads.
Why Join Us
Joining Denvr means being part of a world‑class team in the fast‑moving field of AI and high‑performance computing. We value curiosity, collaboration, and continuous learning. Our people are proactive problem solvers who take pride in delivering great results, thrive in open and transparent environments, and enjoy learning by doing.
About the Role
We are seeking a Site Reliability Engineer (SRE) with experience spanning cloud and data center environments to drive infrastructure reliability, observability, and scalability. In this role, you will design and operate resilient, high‑performance systems that enable cutting‑edge data solutions.
What You’ll Do
Observability & Monitoring: Design, implement, and maintain observability systems with Grafana, Prometheus, Victoria Metrics, and PromQL to monitor system health and performance.
Industry Best Practices: Explore opportunities to improve the overall observability of HPC environments using industry best practices.
Incident Management & Troubleshooting: Participate in on‑call rotations, rapidly diagnose and resolve incidents, and perform postmortem reviews to drive continuous improvements.
Dev Ops & CI/CD: Hands‑on experience in automating Dev Ops pipelines using Git Hub Actions (or similar tools).
Who You Are
Experience:
3‑5 years in a Site Reliability Engineering (SRE) or Dev Ops role.
Infrastructure as Code (IaC): Familiarity with tools like Terraform or Helm, Ansible, Python for automated infrastructure provisioning.
Security Best Practices: Knowledge of security practices and compliance standards for enterprise environments.
HPC Knowledge: Familiarity with high‑performance computing, specifically in administering GPU‑related workloads.
Kubernetes Proficiency: Strong experience managing Kubernetes clusters in production environments.
Observability Tools: Expertise with observability platforms (Grafana, Prometheus, PromQL) for tracking and analyzing system metrics.
Networking: Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, VPNs).
AWS Cloud/Hybrid Cloud: Hands‑on experience developing and deploying production‑grade applications in AWS Cloud under hybrid cloud architecture.
Linux Systems: Proficiency in Linux administration, shell scripting, and performance tuning.
Programming
Experience:
Strong software development skills (e.g., Bash, Python, Golang) to automate infrastructure and operational tasks.
If you are passionate about technology and want to be part of a remote‑first, forward‑thinking company, Denvr would love to hear from you and learn more about your skills and capabilities. Click on the link to apply!
#J-18808-Ljbffr
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×