Senior Engineer IT Reliability Job New York New York USA,IT/Tech

Location: New York

Position Summary

The Senior Reliability Engineer (Infrastructure) is responsible for ensuring the reliability, availability, and recoverability of Jet Blue's critical infrastructure platforms. This role applies engineering discipline to operational challenges, leads response to complex incidents, and drives improvements that reduce operational risk over time. The Senior Reliability Engineer works closely with cloud, platform, network, and application teams to ensure infrastructure systems are observable, resilient, and safe to operate in production, while exhibiting the Jet Blue values of Safety, Caring, Integrity, Passion, and Fun.

Essential

Responsibilities

Own reliability outcomes for critical infrastructure platforms supporting Jet Blue production systems.
Define and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for infrastructure capabilities.
Lead response, diagnosis, and resolution of complex infrastructure incidents as Incident Commander or senior technical authority.
Participate in a 24x7 on‑call rotation and help improve incident response practices.
Diagnose and mitigate failures across Linux systems, Kubernetes platforms, Azure cloud infrastructure, and networking layers.
Review and approve high‑risk infrastructure changes with consideration for blast radius, rollback readiness, and dependency impact.
Identify and mitigate capacity, scaling, and saturation risks across infrastructure systems.
Improve monitoring, alerting, and dashboards to reflect real system health and customer impact.
Reduce operational toil through automation, tooling, and reliability‑focused engineering improvements.
Develop and maintain operational documentation, runbooks, and recovery procedures.
Lead blameless post‑incident reviews and drive corrective actions to prevent repeat incidents.
Mentor engineers on operational excellence, reliability practices, and incident response.
Collaborate with cloud, platform, network, and security teams to ensure reliable and secure infrastructure operations.
Ensure infrastructure platforms meet regulatory, compliance, and security requirements as applicable.
Other duties as assigned.

Minimum Experience and Qualifications

Bachelor's Degree in Computer Science or a related discipline; OR demonstrated capability to perform job responsibilities with a combination of a High School Diploma/GED and at least four (4) years of relevant experience.
Five (5) or more years of experience in Site Reliability Engineering, infrastructure operations, Dev Ops, or production engineering roles.
Demonstrated experience operating and supporting large‑scale production infrastructure.
Strong Linux troubleshooting skills across CPU, memory, disk, and process behavior.
Strong understanding of networking fundamentals including TCP/IP, DNS, load balancing, and failure modes.
Hands‑on experience operating Kubernetes clusters, including troubleshooting, scaling, and failure recovery.
Experience operating infrastructure in a public cloud environment (Azure preferred).
Experience with observability tools including metrics, logs, tracing, and alerting.
Proficiency in at least one programming or scripting language (such as Python, Go, Java, or similar) used to automate operations and improve reliability.
Experience using infrastructure‑as‑code and automation to reduce operational toil.
Ability to make sound decisions under pressure and communicate clearly during incidents.
Able to work flexible hours and participate in on‑call rotations.
Available for occasional overnight travel (10%)
Must pass a pre‑employment drug test
Must be legally eligible to work in the country in which the position is located
Authorization to work in the US is required. This position is not eligible for visa sponsorship

Preferred Experience and Qualifications

Seven (7) or more years of experience in Site Reliability Engineering, infrastructure operations, Dev Ops, or production engineering roles.
Experience defining and operationalizing SLOs and using error budgets to guide reliability decisions.
Experience with capacity planning and demand forecasting.
Experience operating highly available, distributed systems.
Experience mentoring…