Production Support Manager
Listed on 2026-06-29
-
IT/Tech
SRE/Site Reliability, IT Support, Cloud Computing: Infrastructure & Operations
Location:
Hybrid - 3 days per week
As the Production Support Manager within IL Technology Operations at New York Life Insurance Company, you will lead a small, high-impact team responsible for keeping production systems healthy, observable, and resilient. This is a proactive, prevention-focused leadership role — your primary mission is to build monitoring frameworks, automate early warning systems, and implement preventive strategies that stop production issues before they impact the business.
You will balance day-to-day operational excellence with forward-looking strategic initiatives, including platform modernization and the adoption of new technologies such as Amazon Quick Sight.
- Build and continuously evolve a proactive monitoring and alerting framework — designing early warning systems, automated health checks, and trend-based detection that identify and resolve potential production risks before they escalate into system downtime or customer-facing incidents.
- Lead and develop a high-performing onshore production support team (including a Production Support Analyst and Lead), while coordinating offshore resources to ensure 24/7 coverage, consistent quality standards, and culture rooted in prevention-first thinking.
- Drive platform modernization initiatives, including the strategic decommissioning of legacy Microsoft Access databases and implementation of CI/CD pipelines, infrastructure-as-code, and automated deployment processes that reduce manual toil and production risk.
- Develop and communicate operational health metrics, SLA dashboards, and incident trend reports to business stakeholders — delivering transparent, timely insights into production health, preventive actions taken, and continuous improvement outcomes.
- Serve as the escalation point and strategic owner for critical production incidents — rapidly triaging issues, coordinating resolution across teams, and conducting post-incident reviews that drive systemic, preventive improvements.
You bring a proactive, prevention-first mindset — you build systems that prevent fires, not just fight them. You’ve led production support or SRE teams and know what it takes to keep complex environments stable and observable r comfort with AWS services, Dev Ops practices, and automation tooling means you can credibly guide your team through both tactical firefighting and long-term resilience engineering.
You approach legacy modernization with pragmatism: you understand the risks, build the roadmap, and manage the transition without disrupting the business.
As a people leader, you invest genuinely in your team’s growth — conducting meaningful 1:1s, creating development plans, and advocating for those you lead. You navigate difficult conversations with confidence and drive accountability and ownership at every level. Your stakeholder communication is equally strong: you build trust through transparency, delivering clear metrics and dashboards that keep business partners informed on operational health and the preventive steps being taken to protect production stability.
You thrive in fast-paced, evolving environments and bring curiosity, urgency, and continuous improvement to everything you do.
- 5+ years of experience in production support, site reliability engineering (SRE), or technology operations, with a demonstrated focus on proactive monitoring and incident prevention.
- 2+ years of people leadership experience with a proven ability to develop team members, drive accountability, and lead through both strategic initiatives and high-urgency production situations.
- Working knowledge of AWS cloud services (EC2, Cloud Watch, Lambda, S3, RDS, etc.) and hands-on experience designing or implementing monitoring and alerting frameworks that enable proactive detection and prevention of production issues.
- Experience with Dev Ops practices including CI/CD pipelines, infrastructure-as-code, and automated deployment processes that reduce manual effort and production risk.
- Proven ability to develop and execute operational strategies, establish SLAs, and communicate meaningful metrics and system health dashboards to…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).