Software Engineering Director , Production Support Operations
Listed on 2026-05-24
-
IT/Tech
IT Support, Systems Engineer, IT Project Manager, SRE/Site Reliability
Director of Production Support
The Director of Production Support leads teams responsible for ensuring the stability, resilience, and operational excellence of critical technology platforms supporting core lines of business. This role owns end‑to‑end production support operations while driving maturity toward engineering‑first, site reliability–focused practices. The Director identifies and resolves complex technical, operational, risk, and organizational challenges, while building high‑performing, accountable teams across onshore and offshore locations.
This position carries full people management responsibility, including hiring, coaching, performance management, and disciplinary actions, and serves as a key partner to Technology, Risk, and Business leadership.
Own end‑to‑end production support operations for multiple mission‑critical applications supporting key lines of business, ensuring availability, stability, and performance meet defined SLAs and SLOs. Provide accountable, visible leadership for 24x7 operational support, including on‑call models, escalation paths, and incident response effectiveness. Act as the senior escalation point for major incidents, ensuring swift recovery, accurate root cause analysis, and durable remediation.
Incident & Problem ManagementLead cross‑functional incident recovery efforts in partnership with Incident Management, engineering teams, infrastructure, and business stakeholders. Ensure timely root cause analysis (RCA), post‑incident reviews, and corrective actions that prevent recurrence. Establish and mature a production knowledge base, documenting known issues, recovery procedures, and architectural insights.
Engineering‑First & SRE PracticesDrive adoption of Site Reliability Engineering (SRE) and lean engineering principles, including:
- Reduction of toil through automation
- Engineering‑based reliability metrics (error budgets, SLIs/SLOs)
- Proactive resilience and failure prevention practices
Champion automation of repetitive and manual operational tasks, including incident detection, response, validation, and recovery where feasible. Promote a culture of preventative engineering, partnering with development teams to improve system reliability upstream.
Monitoring, Observability & AI EnablementImplement and continuously improve real‑time monitoring, alerting, and observability across applications and infrastructure. Measure and optimize the effectiveness of monitoring and alerting to eliminate noise and accelerate mean‑time‑to‑detect and mean‑time‑to‑recover. Leverage AI and advanced analytics to correlate telemetry data (logs, metrics, traces) and proactively identify emerging risks and root causes. Champion the safe and responsible use of AI within production operations by adhering to enterprise guardrails and protecting sensitive data and system integrity.
OperationalReadiness & Change Enablement
Oversee operational readiness across releases, disaster recovery and failover testing and certificate and dependency lifecycle management. Ensure production support is actively embedded in change planning, minimizing risk from releases and infrastructure changes.
People, Vendor & Financial ManagementLead one or more Agile teams (Scrum, Kanban), including onshore and offshore engineers, fostering high performance and accountability. Manage workforce vendors and partners, setting expectations, reviewing performance, and ensuring delivery quality. Own budget and staffing plan aligned to application criticality, operational risk, and business growth objectives.
Risk Management & GovernanceAct as the first line of defense in production operations by proactively identifying and mitigating technology, operational, and resiliency risks. Partner effectively with second‑line Risk, Audit, and Regulatory teams, ensuring findings are addressed and controls are continuously improved. Ensure compliance with internal policies, regulatory requirements, and external audit expectations. Own and drive remediation plans for risk, audit, and regulatory findings, ensuring timely, effective and sustainable…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).