Lead SRE/DevOps Engineer
Listed on 2025-12-30
-
IT/Tech
Cloud Computing, Systems Engineer, IT Support, SRE/Site Reliability
We are
At Synechron, we believe in the power of digital to transform businesses for the better. Our global consulting firm combines creativity and innovative technology to deliver industry-leading digital solutions. Synechron’s progressive technologies and optimization strategies span end-to-end Artificial Intelligence, Consulting, Digital, Cloud & Dev Ops, Data, and Software Engineering, servicing an array of noteworthy financial services and technology firms. Through research and development initiatives in our Fin Labs we develop solutions for modernization, from Artificial Intelligence and Blockchain to Data Science models, Digital Underwriting, mobile‑first applications and more.
Over the last 20+ years, our company has been honored with multiple employer awards, recognizing our commitment to our talented teams. With top clients to boast about, Synechron has a global workforce of 14,500+, and has 58 offices in 21 countries within key global markets.
The base salary for this position will vary based on geography and other factors. In accordance with law, the base salary for this role if filled within Pittsburgh, PA / Dallas, TX is $125k - $135k / year & benefits (see below).
The Role Responsibilities :Observability & Monitoring
- Implement and enhance proactive observability frameworks to anticipate and mitigate issues before they occur.
- Optimize experience monitoring and user interaction metrics across applications and services.
- Manage and improve the event catalog, ensuring all system events are structured and actionable.
- Build and maintain dashboards, alerts, and health reporting using tools like Dynatrace, Big Panda, Mon Pro, and Log Scale.
- Perform service tuning to improve system performance based on real‑time metrics and data analysis.
- Establish and maintain observability standards and best practices across teams.
- Conduct chaos testing and resilience validation to ensure high system availability.
- Lead anomaly detection practices to quickly identify and respond to unusual system behavior.
- Ensure platform stability, performance, and reliability through proven reliability engineering principles.
- Drive SRE initiatives, including continuous improvement projects within the Site Reliability Center.
- Develop, maintain, and scale automated orchestration pipelines to streamline operations and improve efficiency.
- Create, maintain, and enforce SRE standards, including SLIs, SLOs, and operational playbooks.
- Lead and conduct root cause analysis for critical incidents and drive long‑term remediation improvements.
- Own the problem management lifecycle—identifying, tracking, and resolving underlying issues to prevent recurring incidents.
- Collaborate with cross‑functional teams to address systemic issues and drive operational resilience.
- 7+ years of experience in SRE, Dev Ops, or Infrastructure Engineering roles.
- Hands‑on expertise with observability / monitoring tools such as :
- Dynatrace (APM, RUM, dashboards, alerting)
- Big Panda (event correlation, incident response)
- Log Scale / Mon Pro / Logic Monitor or similar log and metrics platforms
- Solid experience with cloud platforms (AWS, Azure, or GCP).
- Strong proficiency in automation & orchestration (Terraform, Ansible, Jenkins, Git Hub Actions, etc.).
- Proven track record in incident management, RCA, and implementing reliable SRE practices.
- Experience with CI / CD pipelines, infrastructure as code, and configuration management.
- Deep understanding of Linux systems, networking fundamentals, and distributed system design.
- Strong scripting abilities (Python, Bash, Power Shell, or equivalent).
- Excellent communication, leadership, and cross‑team collaboration skills.
- Experience leading SRE or Dev Ops teams.
- Knowledge of chaos engineering, advanced anomaly detection, and proactive alerting strategies.
- Experience implementing SLI / SLO frameworks and performance optimization programs.
- Familiarity with containerization (Docker, Kubernetes) and service meshes.
- A highly competitive compensation and benefits package.
- A multinational organization with 58 offices in 21 countries and the…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).