×
Register Here to Apply for Jobs or Post Jobs. X

PERM*- Senior Site Reliability Engineer Linux and Python to and Optimize batch

Job in Toronto, Ontario, M5A, Canada
Listing for: S.i. Systems
Full Time, Part Time, Contract position
Listed on 2025-12-28
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, IT Support, SRE/Site Reliability
Job Description & How to Apply Below
Position: *CONTRACT TO PERM*- Senior Site Reliability Engineer with Linux and Python experience to improve and optimize the batch


* CONTRACT TO PERM*- Senior Site Reliability Engineer with Linux and Python experience to improve and optimize the batch jobs of applications-

Location Address:
Hybrid - 44 King – 3 days/week onsite (days will vary depending on team)

Subject to change: 3–4 days onsite may be required based on business needs

Contract Duration: 6 months (Must convert to perm after 6 months)

Schedule

Hours:

9am-5pm Monday-Friday; standard 37.5 hrs/week

Story Behind the Need

  • Business group:
    Global Banking and Markets Engineering (GBME) is the fast-moving, award-winning technology engine that powers Scotiabank’s Corporate, Investment Banking and Capital Markets businesses. Team works with all GBME applications to ensure they are reliable
  • Project: GBME is searching for SRE’s who are continuous learners are and are eager to boost capabilities of capital markets products and analytics platforms. Improvement and optimization of batch jobs of applications
  • Resource will be aligned to application portfolio in GBME and ensure their batches are optimized and running in a resilient way; measured by SLA adherence for batch jobs
  • Typical Day in Role:

  • Reliability & Performance:
    Ensure stability and optimize batch processing pipelines; reduce runtime and failure rates, engineering for resiliency.
  • Observability:
    Implement and maintain monitoring with Dynatrace; create dashboards, alerts, and runbooks.
  • Systems Engineering:
    Manage and tune Linux and Windows systems for performance and resilience.
  • Automation & Orchestration:
    Create/Modify and optimize Airflow DAGs; build CI/CD pipelines for automation.
  • Incident Management:
    Lead incident response, root cause analysis, and postmortems; enforce SLOs and reliability practices.
  • Security & Compliance:
    Apply security best practices and ensure regulatory compliance in systems and automation.
  • Must Have

    Skills:

    1) 10+ years of relevant working experience

    2) 7+ years’ Linux Systems Expertise:
    Kernel/OS tuning, networking, filesystem optimization, process management, and troubleshooting.

    3)5+ years’ experience with application performance monitoring

    4) 7+ years’ experience with a more modern development languages (Python required, Java and others an asset,

    5) 3+ years’ Airflow Expertise: DAG design best practices, SLA management, scheduler/executor tuning, and scaling strategies.

    6) Proven experience optimizing batch workloads for performance, reliability, and cost. Strong understanding of distributed systems concepts retries, idempotency, backpressure, and data integrity. Strong understanding of backend systems and batch optimization.

    7) Proven experience with containers and orchestration (Docker, Kubernetes).

    8) Excellent incident management and root cause analysis skills.

    Nice-To-Have

    Skills:

    1) Dynatrace Mastery: Custom dashboards, KPIs, anomaly detection, tagging strategy, and alerting configuration.

    2) Proficiency with CI/CD pipelines (Git Hub Actions, Azure Dev Ops, Jenkins) and Infrastructure as Code (Terraform, Ansible).

    3) Experience with some automated deployment.

    4) Understanding of networking protocols and security principles

    5) Capital Markets product knowledge

    6) GCP Cloud experience

    7) Experience working with real-time, high availability and low latency systems

    Education:

    Bachelor’s degree in computer science, Engineering, or related field.

    Cloud certifications an asset

    IaC automation certifications an asset

    Best VS. Average Candidate:

    The ideal candidate is passionate about Site Reliability Engineering (SRE), with a strong focus on building reusable, efficient, and scalable environments. They thrive in an innovative, cross-functional team setting and bring a strong technical and engineering mindset to the role.

    Key attributes of the successful candidate include:

    Extensive batch processing experience and a hands-on approach to problem-solving.

    Proficiency in programming, deep Linux system expertise, and solid application monitoring experience.

    Ideally, a developer who has transitioned into an SRE role, combining development skills with reliability engineering practices.

    Familiarity with typical SRE/Dev Ops tools is helpful but less critical for this position.

    Candidate Review & Selection – Interview Process

    2 rounds – 1 hour – in person at 44 King

    1st with HM

    2nd with GBME

    Position Requirements
    10+ Years work experience
    Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
    To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
     
     
     
    Search for further Jobs Here:
    (Try combinations for better Results! Or enter less keywords for broader Results)
    Location
    Increase/decrease your Search Radius (miles)

    Job Posting Language
    Employment Category
    Education (minimum level)
    Filters
    Education Level
    Experience Level (years)
    Posted in last:
    Salary