Senior Associate – DR Recovery Lead; IT Operations
Listed on 2025-12-27
-
IT/Tech
Systems Engineer, Cybersecurity, Cloud Computing, IT Project Manager
Location: New York
Location Designation:
Hybrid – 3 days per quarter
As part of Technology, you’ll have the opportunity to contribute to groundbreaking initiatives that shape New York Life’s digital landscape. Leverage cutting‑edge technologies like Generative AI to increase productivity, streamline processes, and create seamless experiences for clients, agents, and employees. Your expertise fuels innovation, agility, and growth – driving the company’s success.
Role SummaryNew York Life is standing up a repeatable, automation‑first Disaster Recovery (DR) operating model to ensure we can sustain a Minimum Viable Company (MVC) and recover priority services within 48 hours. As the DR Recovery Lead (IT Ops), you will be the single‑threaded owner for day‑to‑day DR operations‑driving orchestration execution, maintaining infra/app runbooks, coordinating cross‑tech teams and vendors, and ensuring audit‑ready evidence for quarterly exercises and an annual recovery test calendar.
You’ll also align DR with enterprise architecture and regulatory standards and continuously improve our capabilities.
- Own DR operations & runbooks: Build, maintain, and continuously improve infrastructure and application recovery runbooks aligned to the enterprise DR framework and RACI.
- Execute orchestrated recoveries: Lead automation‑first recovery using IaC/pipelines and evidence harness to capture artifacts, health checks, and outcomes for audit.
- Plan & run tests: Lead quarterly tabletop/functional validations, drive an annual DR exercise calendar, and manage test evidence and acceptance with business owners.
- Safeguard environments: Monitor configuration parity and drift; ensure DR capacity/readiness across failover patterns; coordinate change windows with APSO/CAB.
- Restore securely: Coordinate restoration of IAM, keys/certs, and control re‑enablement in alignment with cyber‑incident procedures.
- Recover data with integrity: Partner with DBA/Data teams on backup/restore or replication, validation, and reconciliation steps.
- Prove service health: Define and run synthetic probes/SLIs/SLOs and publish dashboards to verify recoverability.
- Manage vendors: Orchestrate third‑party SLAs, negotiate test windows, and validate contractual obligations and evidence.
- Map & prioritize services: Maintain Critical Business Service (CBS) inventories and dependencies; scale playbooks across priority CBS.
- Lead during incidents: Serve as DR operations lead for activation, coordinating comms and cross‑tech execution through recovery.
- Architectural alignment: Ensure DR strategies, patterns, and runbooks conform to enterprise architecture standards, reference architectures, and future‑state infrastructure plans; participate in design reviews and provide DR non‑functional requirements.
- Multi‑cloud & cloud‑native DR: Engineer and operate DR solutions across on‑prem and multi‑cloud environments (e.g., AWS/Azure), leveraging cloud‑native patterns such as active/active, regional failover, immutable infrastructure, and serverless recovery.
- Regulatory & compliance: Embed controls and evidence to meet NYDFS, SOX, GDPR, and related obligations; align to NIST (e.g., SP 800‑34/61) and ISO 22301 principles; maintain audit‑ready artifacts and traceability.
- Continuous improvement & innovation: Drive quarterly improvement backlogs; pilot emerging techniques (e.g., chaos engineering/game days, AI‑assisted recovery validation), retire manual steps, and report ROI.
- 8 years in IT Operations / SRE / DR or equivalent enterprise resiliency roles.
- Hands‑on experience with DR patterns (active/active, active/passive), backup/restore & replication, and hybrid/multi‑cloud infrastructure.
- Strong automation/IaC background (e.g., Terraform/Cloud Formation), CI/CD pipelines, and scripting (Power Shell, Bash, or Python).
- Proven test planning & execution (tabletops through functional validation) with rigorous evidence capture.
- Familiarity with security control restoration (IAM, PKI, secrets) and alignment to cyber‑incident runbooks.
- Observability expertise (health checks, synthetic probes, SLIs/SLOs, dashboards).
- Effective vendor management, change/incident coordination…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).