SRE Engineer
Markham, Ontario, Canada
Listed on 2025-12-02
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Join to apply for the SRE Engineer role at kloia
Description
Kloia is a recognized AWS Premier Consulting Partner and CNCF member with a focus on Application Modernization and Digital Transition projects.
Our teams are growing rapidly, and we’re hiring a Site Reliability Engineer primarily for our managed services provided to customers, as well as for internal projects to build a scalable and reliable platform of common services.
What does SRE do?
In Kloia, the SRE Team focuses on eliminating toil in production workloads. Our main goal is to achieve 24x7 SLA with a support system and team that ‘Follow-the-Sun’.
Key responsibilities include participating in design and development, making trade-offs between performance, cost, security, and reliability, and supporting the system in production as a reliable escalation point.
As an SRE, you will:
- Eliminate toil through automation, re-architecting, and refactoring.
- Approach incidents with an “Automate Everything” mindset.
- Collaborate with software engineers to troubleshoot incidents.
- Drive complex infrastructure changes with transparency and zero downtime.
- Design and implement self-healing, reliable, and scalable infrastructure in a cloud-native environment.
- Guide and unblock developers across teams to push their products forward.
- Define SLOs and error quotas for production services.
- Support our dev-ops culture, including participation in the follow-the-sun on-call rota.
Position: SRE (Site Reliability Engineer)
Location: Remote - LATAM / APAC
Level: Junior/Medior
What does an average day look like?
Proactively support production workloads, troubleshoot to find root causes, and write or review postmortems. Identify infrastructure and observability weaknesses.
Technical challenges include:
- Optimizing resource allocation in Kubernetes for application performance.
- Including API Gateway monitoring in APM for full observability.
- Reducing database query hits.
- Guiding development team on data layer caching.
Our stack is cloud-native, including AWS, Terraform, Docker/Kubernetes, Helm, ELK, Instana, Ops Genie, Node.js, Java, Typescript, Python. We expect candidates to have a deep understanding of Linux-based distributed systems at scale and relevant experience.
Who should apply?
This role suits those eager to work with cutting-edge cloud infrastructure at scale, passionate about automation, and capable of explaining complex concepts simply.
Career benefits:
Exposure to new technologies, working on products with global reach, and opportunities to develop both development and operations skills. We encourage continuous learning with initiatives like hack days and training.
Requirements:
- Excellent communication skills
- Deep knowledge of Linux distributed systems at scale
- Experience with AWS or other cloud providers
- Experience with SQL/No
SQL databases at scale - Experience with service lifecycle and monitoring
- Experience as a software or platform engineer / SRE
- Experience with Dev Ops practices
- Good understanding of Docker
- Automation mindset
Nice to have:
- Knowledge of Kubernetes
- Experience with Terraform or other Infrastructure as Code tools
Benefits include:
- Remote work flexibility
- Home office budget
- Hackathon days
- Access to AWS and CNCF/Kubernetes training and certifications
- R&D focus
- Social activities like weekly Lunch & Learn, Fridays, socials, and online games
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: