PERM*- Senior Site Reliability Engineer Linux and Python to and Optimize batch Job Toronto area,Ontario Canada,IT/Tech

Position: *CONTRACT TO PERM*- Senior Site Reliability Engineer with Linux and Python experience to improve and optimize the batch

* CONTRACT TO PERM*- Senior Site Reliability Engineer with Linux and Python experience to improve and optimize the batch jobs of applications-

Location Address:
Hybrid - 44 King – 3 days/week onsite (days will vary depending on team)

Subject to change: 3–4 days onsite may be required based on business needs

Contract Duration: 6 months (Must convert to perm after 6 months)

Schedule

Hours:

9am-5pm Monday-Friday; standard 37.5 hrs/week

Story Behind the Need

Business group:
Global Banking and Markets Engineering (GBME) is the fast-moving, award-winning technology engine that powers Scotiabank’s Corporate, Investment Banking and Capital Markets businesses. Team works with all GBME applications to ensure they are reliable

Project: GBME is searching for SRE’s who are continuous learners are and are eager to boost capabilities of capital markets products and analytics platforms. Improvement and optimization of batch jobs of applications

Resource will be aligned to application portfolio in GBME and ensure their batches are optimized and running in a resilient way; measured by SLA adherence for batch jobs

Typical Day in Role:

Reliability & Performance:
Ensure stability and optimize batch processing pipelines; reduce runtime and failure rates, engineering for resiliency.

Observability:
Implement and maintain monitoring with Dynatrace; create dashboards, alerts, and runbooks.

Systems Engineering:
Manage and tune Linux and Windows systems for performance and resilience.

Automation & Orchestration:
Create/Modify and optimize Airflow DAGs; build CI/CD pipelines for automation.

Incident Management:
Lead incident response, root cause analysis, and postmortems; enforce SLOs and reliability practices.

Security & Compliance:
Apply security best practices and ensure regulatory compliance in systems and automation.

Must Have

Skills:

1) 10+ years of relevant working experience

2) 7+ years’ Linux Systems Expertise:
Kernel/OS tuning, networking, filesystem optimization, process management, and troubleshooting.

3)5+ years’ experience with application performance monitoring

4) 7+ years’ experience with a more modern development languages (Python required, Java and others an asset,

5) 3+ years’ Airflow Expertise: DAG design best practices, SLA management, scheduler/executor tuning, and scaling strategies.

6) Proven experience optimizing batch workloads for performance, reliability, and cost. Strong understanding of distributed systems concepts retries, idempotency, backpressure, and data integrity. Strong understanding of backend systems and batch optimization.

7) Proven experience with containers and orchestration (Docker, Kubernetes).

8) Excellent incident management and root cause analysis skills.

Nice-To-Have

Skills:

1) Dynatrace Mastery: Custom dashboards, KPIs, anomaly detection, tagging strategy, and alerting configuration.

2) Proficiency with CI/CD pipelines (Git Hub Actions, Azure Dev Ops, Jenkins) and Infrastructure as Code (Terraform, Ansible).

3) Experience with some automated deployment.

4) Understanding of networking protocols and security principles

5) Capital Markets product knowledge

6) GCP Cloud experience

7) Experience working with real-time, high availability and low latency systems

Education:

Bachelor’s degree in computer science, Engineering, or related field.

Cloud certifications an asset

IaC automation certifications an asset

Best VS. Average Candidate:

The ideal candidate is passionate about Site Reliability Engineering (SRE), with a strong focus on building reusable, efficient, and scalable environments. They thrive in an innovative, cross-functional team setting and bring a strong technical and engineering mindset to the role.

Key attributes of the successful candidate include:

Extensive batch processing experience and a hands-on approach to problem-solving.

Proficiency in programming, deep Linux system expertise, and solid application monitoring experience.

Ideally, a developer who has transitioned into an SRE role, combining development skills with reliability engineering practices.

Familiarity with typical SRE/Dev Ops tools is helpful but less critical for this position.

Candidate Review & Selection – Interview Process

2 rounds – 1 hour – in person at 44 King

1st with HM

2nd with GBME


Increase/decrease your Search Radius (miles)



Job Posting Language