Senior Manager,Site Reliability Engineering Job Dallas area,Texas USA,IT/Tech

Overview

Senior Manager, Site Reliability Engineering

The Site Reliability Engineering Manager is responsible for overseeing the daily operations and delivery of the Site Reliability Engineering teams. This role plays a key part in driving team productivity and ensuring the ongoing health, performance, resilience, and stability of Catalyst’s eCommerce and CRM platforms. In addition to managing operational aspects, the SRE Sr.Manager actively contributes to the technical direction of the team.

This includes shaping the automation strategy, guiding telemetry and observability practices, leading solution delivery, and managing incidents and problems affecting platform reliability. This is a hybrid leadership role that combines technical expertise with people management. The SRE Manager also contributes to both short and long-term planning initiatives—spanning systems architecture, team development, and organizational strategy.

What You Will Do

Provide both technical and people leadership to Site Reliability Engineering (SRE) teams through regular one-on-one meetings, team syncs, and performance reviews.
Manage project execution by organizing cross-functional teams, assigning responsibilities, and tracking progress against defined schedules and milestones.
Assist in budgeting, workforce planning, hiring, and third-party contract negotiations to support team growth and operational goals.
Drive continuous improvements in platform reliability, stability, and performance by overseeing the deployment of fully automated telemetry, observability, and AI-driven monitoring solutions.
Lead the development and enhancement of intelligent alerting and automated incident response systems to improve service restoration speed and issue detection.
Collaborate with administrators and platform engineers on implementation decisions to ensure highly reliable infrastructure, systems, and integrations.
Document all changes in accordance with change control policies and documentation standards; identify risks and recommend corrective actions when necessary.
Provide advanced Incident Management and Problem Management support by analyzing telemetry data and system logs to identify, remediate, and prevent reliability issues.
Participate in on-call escalation support rotations in alignment with the 24/7/365 support model.
Act as the Escalation Manager/Critical Incident Manager during major incidents, guiding teams through structured and effective service recovery.
Communicate timely updates and incident reports to senior leadership during and after critical events.
Lead conversations and provide business and engineering support for both internal stakeholders and external customers.

What You Will Need Experience & Leadership

10+ years of experience in global organizations, with a proven ability to communicate effectively across all levels—from executives to individual contributors.
5+ years of hands-on Site Reliability Engineering (SRE) experience, including platform automation, telemetry, observability, and self-healing systems.
Demonstrated leadership and collaboration in high-availability, mission-critical digital environments.
Should have strong support knowledge and understanding on retail ecommerce flow - Web and Mobile technologies.
Work with software engineers across scrum teams and performance engineering to ensure systems are meeting reliability and performance standards.
Hands‑on experience with debugging, optimizing code and automation.
Identify opportunities to adopt innovative technologies and continuous improvement – Automation, Shift left, Self‑Heal.

Platform & Application Support

Extensive experience supporting and administering digital retail and eCommerce platforms with one of the Cloud providers (AWS/Azure/Google Cloud).
Demonstrated experience in application design, software development, testing and production support of Java‑J2EE based eCommerce applications.
Practical experience monitoring and maintaining streaming platform technologies such as Apache Kafka.
Deep understanding of cloud-native architectures and platform operations.

Monitoring, Telemetry & Observability

Proficient with modern monitoring, logging, and telemetry…