SRE Engineer Job Markham Ontario Canada,IT/Tech

Join to apply for the SRE Engineer role at kloia

Description

Kloia is a recognized AWS Premier Consulting Partner and CNCF member with a focus on Application Modernization and Digital Transition projects.

Our teams are growing rapidly, and we’re hiring a Site Reliability Engineer primarily for our managed services provided to customers, as well as for internal projects to build a scalable and reliable platform of common services.

What does SRE do?

In Kloia, the SRE Team focuses on eliminating toil in production workloads. Our main goal is to achieve 24x7 SLA with a support system and team that ‘Follow-the-Sun’.

Key responsibilities include participating in design and development, making trade-offs between performance, cost, security, and reliability, and supporting the system in production as a reliable escalation point.

As an SRE, you will:

Eliminate toil through automation, re-architecting, and refactoring.
Approach incidents with an “Automate Everything” mindset.
Collaborate with software engineers to troubleshoot incidents.
Drive complex infrastructure changes with transparency and zero downtime.
Design and implement self-healing, reliable, and scalable infrastructure in a cloud-native environment.
Guide and unblock developers across teams to push their products forward.
Define SLOs and error quotas for production services.
Support our dev-ops culture, including participation in the follow-the-sun on-call rota.

Position: SRE (Site Reliability Engineer)

Location: Remote - LATAM / APAC

Level: Junior/Medior

What does an average day look like?

Proactively support production workloads, troubleshoot to find root causes, and write or review postmortems. Identify infrastructure and observability weaknesses.

Technical challenges include:

Optimizing resource allocation in Kubernetes for application performance.
Including API Gateway monitoring in APM for full observability.
Reducing database query hits.
Guiding development team on data layer caching.

Our stack is cloud-native, including AWS, Terraform, Docker/Kubernetes, Helm, ELK, Instana, Ops Genie, Node.js, Java, Typescript, Python. We expect candidates to have a deep understanding of Linux-based distributed systems at scale and relevant experience.

Who should apply?

This role suits those eager to work with cutting-edge cloud infrastructure at scale, passionate about automation, and capable of explaining complex concepts simply.

Career benefits:

Exposure to new technologies, working on products with global reach, and opportunities to develop both development and operations skills. We encourage continuous learning with initiatives like hack days and training.

Requirements:

Excellent communication skills
Deep knowledge of Linux distributed systems at scale
Experience with AWS or other cloud providers
Experience with SQL/No

SQL databases at scale
Experience with service lifecycle and monitoring
Experience as a software or platform engineer / SRE
Experience with Dev Ops practices
Good understanding of Docker
Automation mindset

Nice to have:

Knowledge of Kubernetes
Experience with Terraform or other Infrastructure as Code tools

Benefits include:

Remote work flexibility
Home office budget
Hackathon days
Access to AWS and CNCF/Kubernetes training and certifications
R&D focus
Social activities like weekly Lunch & Learn, Fridays, socials, and online games

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language