Site Reliability Engineer; SRE
Job in
Toronto, Ontario, M5A, Canada
Listing for:
Tangerine
Full Time
position
Listed on 2026-02-09
Job specializations:
-
IT/Tech
Cloud Computing, IT Support, Systems Engineer, SRE/Site Reliability
Job Description & How to Apply Below
Position: Site Reliability Engineer (SRE)
Is this role right for you?
In this role you will:
Manages the team workflow to maximize business and technical efficiencies. Develop and guide the team members in enhancing their technical capabilities and increasing productivitySupervises IT Support team; assigns and prioritizes production incidents, and problems, trains and coaches IT Support teams on ways to improve customer support; develop staff skillsEnsures all production issues are resolved within SLAs, and user requests are completed satisfactorily and that all customer requests are responded to in a timely manner.You’ll be responsible for providing investigation and second level support on client issues, technical issues, system/web site outages and questions from all internal and external application by maintaining, prioritization and addressing to respective Tangerine technology groups and vendors.You will run the production environment by monitoring availability and taking a holistic view of system health.You will improve our suite of software solutions' reliability, quality, and time-to-market.Measure and optimize system performance to push our capabilities forward, get ahead of customer needs, and innovate to improve continuallyYou’ll be responsible for maintaining the production applications and day-to-day operational activities, manage escalations and modify established procedures / approaches to suit specific situations including 24 x 7 support and coordination of recovery effortsParticipate in defining SLIs, SLOs and SLAs for Enterprise SystemsGather and analyze metrics from both applications and infrastructure to assist in performance tuning and fault findingPartner with development teams to address outstanding tickets and implement permanent fixesCreate sustainable systems and services through automation and process improvements.Balance feature development speed and reliability with well-defined service level objectives.Monitor multiple application health and discover opportunities to optimize in a continuously growing large complex hybrid environment.Lead on-call problem escalation and outage recovery effort, not limited to code fixes in presentation and integration layer, but also provide infrastructure level investigation and support where necessary.Lead post-incident technical retrospect to discover and implement remediation actions.You will perform troubleshooting, deploy systems or execute maintenance tasks as necessary to meet the specified SLOs.Do you have the skills that will enable you to succeed in this role? We'd love to work with you if you have:
2-4 years of experience in developing and/or supporting complex, large-scale customer-facing platforms.Good understanding of multi-tier applications, microservices (Docker, Kubernetes etc.)Experience instrumenting and monitoring cloud hosted software stacks (preferably GCP, Vertex AI, GCE, Network, Big Query, Cloud SQL)Good understanding of networking concepts: TCP/IP, DNS, HTTP, TLS, OSI Model.Familiarity with Tech Stack is Java/J2EE/Spring Boot/Python/JS Node Js:
Front End IOS, Android native Apps;
Deploymnet Runtime: K8s, Web Sphere, Web Sphere Liberty, NdeJS/TS.Basic knowledge of one or more scripting languages (Ansible, Terraform, Bash etc.).Strong working experience with incident management and setting up monitoring alerts.Have a proficient understanding of code versioning tools, such as Git/Bitbucket.Knowledge about building a highly automated production monitoring and support model, hands-on experience integrating Splunk, Ansible, Dynatrace, Sumologic, Service now ,, or equivalents.Proven ability to translate ideas into technical and business realities and map technology to business problems.Experience with private/public cloud services and platforms.Superior verbal and written communication skills with the ability to influence decision-making with stakeholders.A proactive approach to spotting problems, areas for improvement, and performance bottlenecks.Exceptional written and verbal communication skillsExcellent problem-solving skillsFlexible approach to work and the ability to adapt to changePrior production support or SRE experience.Proficient with MS…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here: