AI Systems Reliability Engineer Position
Job in
Toronto, Ontario, C6A, Canada
Listed on 2026-06-04
Listing for:
Tenstorrent
Full Time
position Listed on 2026-06-04
Job specializations:
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, Network Engineer
Job Description & How to Apply Below
In this role, you will focus on the intersection of reliability and customer engineering, validating that our AI systems are production-ready. Engaging with internal teams, you will tackle complex issues and enhance monitoring and automation processes, contributing significantly to system performance and reliability.
Key Responsibilities:
• Maintain operational integrity of AI infrastructures
• Troubleshoot issues spanning compute, network, and software
• Collaborate with teams for incident response
• Enhance monitoring and observability frameworks
• Create automation solutions to boost reliability
Requirements:
• Expertise in site reliability or systems engineering
• Advanced Linux troubleshooting capabilities
• Knowledge of observability tools like Prometheus
• Proficiency in scripting with Python or Go
• Solid grasp of networking principles at scale
Elevate AI infrastructure through your role, ensuring robust and reliable systems with efficient operational practices.
#J-18808-Ljbffr
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×