More jobs:
Site Reliability Engineer
Job Description & How to Apply Below
Key Responsibilities
- Incident Management and Reliability: Lead the incident management process, ensuring high availability and performance of the applications. Develop and implement SRE practices to improve system reliability and resilience.
- Monitoring and Observability: Utilize Dynatrace, Splunk, and Grafana to monitor system health, detect anomalies, and provide actionable insights for performance optimization.
- Root Cause Analysis: Conduct thorough root cause analysis of incidents and outages, developing long-term solutions to prevent recurrence.
- Dev Ops Practices: Collaborate with development and operations teams to streamline CI/CD pipelines, automate workflows, and implement infrastructure as code (IaC) for efficient service deployment and management.
- Networking Expertise: Provide expertise in networking technologies (Cisco, Arista, AVI, etc.), ensuring robust network infrastructure design, implementation, and troubleshooting. Utilize tools like Wireshark for in-depth network analysis and debugging.
- Collaboration and Leadership: Work closely with cross-functional teams to share knowledge, mentor junior engineers, and lead by example in adopting best practices in SRE, Dev Ops, and networking.
- Innovation and Continuous Improvement: Stay abreast of industry trends and new technologies, advocating for and implementing innovative solutions to enhance system reliability and performance.
- Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field.
- 10+ years of experience in an SRE/Dev Ops role, with a proven track record in managing high-availability systems.
- Solid expertise in monitoring and observability tools (Dynatrace, Splunk, Grafana).
- Proficient in network debugging and analysis tools, including Wireshark.
- Solid understanding of on-prem and hybrid cloud infrastructure (VMware, Linux, Windows, Azure) and container orchestration (Kubernetes, Docker).
- Certifications in relevant technologies (Dynatrace, Splunk) are a plus.
- Excellent communication and leadership skills, capable of leading incident response initiatives and collaborating effectively across teams.
- Excellent problem-solving skills, with the ability to conduct comprehensive root cause analysis and troubleshooting.
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×