Site Reliability Engineer Job Ottawa area,Ontario Canada,IT/Tech

Position: Site Reliability Engineer )

Key Responsibilities

Incident Management and Reliability: Lead the incident management process, ensuring high availability and performance of the applications. Develop and implement SRE practices to improve system reliability and resilience.
Monitoring and Observability: Utilize Dynatrace, Splunk, and Grafana to monitor system health, detect anomalies, and provide actionable insights for performance optimization.
Root Cause Analysis: Conduct thorough root cause analysis of incidents and outages, developing long-term solutions to prevent recurrence.
Dev Ops Practices: Collaborate with development and operations teams to streamline CI/CD pipelines, automate workflows, and implement infrastructure as code (IaC) for efficient service deployment and management.
Networking Expertise: Provide expertise in networking technologies (Cisco, Arista, AVI, etc.), ensuring robust network infrastructure design, implementation, and troubleshooting. Utilize tools like Wireshark for in-depth network analysis and debugging.
Collaboration and Leadership: Work closely with cross-functional teams to share knowledge, mentor junior engineers, and lead by example in adopting best practices in SRE, Dev Ops, and networking.
Innovation and Continuous Improvement: Stay abreast of industry trends and new technologies, advocating for and implementing innovative solutions to enhance system reliability and performance.

Qualifications

Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field.
10+ years of experience in an SRE/Dev Ops role, with a proven track record in managing high-availability systems.
Solid expertise in monitoring and observability tools (Dynatrace, Splunk, Grafana).
Proficient in network debugging and analysis tools, including Wireshark.
Solid understanding of on-prem and hybrid cloud infrastructure (VMware, Linux, Windows, Azure) and container orchestration (Kubernetes, Docker).
Certifications in relevant technologies (Dynatrace, Splunk) are a plus.
Excellent communication and leadership skills, capable of leading incident response initiatives and collaborating effectively across teams.
Excellent problem-solving skills, with the ability to conduct comprehensive root cause analysis and troubleshooting.

#J-18808-Ljbffr