Sr Lead Infrastructure Engineer - Infrastructure Monitoring
Listed on 2026-06-16
-
IT/Tech
SRE/Site Reliability, IT Infrastructure, Systems Engineer
We have an opportunity to impact your career and provide an adventure where you can push the limits of what's possible.
As a Sr Lead Infrastructure Engineer‑Infrastructure Monitoring at JPMorgan Chase within the Corporate Technology Enterprise Observability Platforms team , you will lead the modernization of Infrastructure monitoring into a strategic, secure, scalable, and automation‑enabled observability platform‑strengthening firmwide resilience and delivering trusted operational insights.
You will be a hands‑on technical contributor who drives adoption and partners across infrastructure, application, and SRE teams to improve telemetry collection and signal quality, modernize event‑to‑incident workflows, and enable AIOps‑driven reliability improvements aligned to business objectives.
Job responsibilitiesLead the modernization of the infrastructure monitoring platform, defining target‑state architecture and roadmap while balancing near‑term delivery with long‑term resiliency, scalability, security, and usability goals
Engineer, operate, and continuously improve enterprise monitoring platforms to meet availability, performance, scale, and security requirements. Own platform design and architecture for telemetry collection and integration across metrics, logs, events, and traces, including Open Telemetry patterns where applicable
Drive large‑scale enterprise onboarding across Linux, Windows, and complex network estates, including lifecycle management, versioning/upgrade strategies, and governance controls
Standardize onboarding patterns (agents/collectors, configuration baselines, dashboards, alerting, metadata, and runbooks) to enable safe, repeatable adoption
Improve signal quality and actionability through baselining, threshold strategy, noise reduction, enrichment, and topology/context alignment to reduce MTTR and operational overhead
Develop and maintain production‑grade automation, services, and configuration‑as‑code; establish engineering standards and conduct rigorous reviews for reliability, security, and maintainability
Reduce operational toil through automation and CI/CD‑driven configuration management, including infrastructure‑as‑code patterns (e.g., Terraform). Lead production health and operational excellence for the monitoring platform, including incident triage, root‑cause analysis, and corrective/preventative actions
Partner with infrastructure, application, and SRE teams to align platform capabilities to SLIs/SLOs, operational readiness, and continuous improvement objectives
Advance AIOps capabilities (e.g., correlation, anomaly detection, guided remediation) through experimentation, proofs of concept, and governed rollouts, while mentoring junior engineers and fostering a strong engineering culture
Formal training or certification on infrastructure engineering concepts and 5+ years applied experience
Demonstrated experience owning/operating enterprise‑scale monitoring/observability platforms in production, and designing & delivering monitoring solutions across large Linux and Windows estates.
Strong expertise with enterprise‑grade operating systems (Windows Server and/or Enterprise Linux), including secure configuration, patching, and vulnerability remediation in regulated environments.
Strong understanding of telemetry concepts (metrics, logs, traces, events) and practical Open Telemetry collection and integration patterns.
Strong infrastructure knowledge across compute, networking, storage, databases, integration patterns, scaling, resiliency, and performance.
Advanced proficiency in automation and scripting (Python, Ansible, Power Shell, Bash) with strong use of CI/CD for controlled change and safe rollout.
Hands‑on experience with infrastructure‑as‑code for repeatable, governed provisioning and deployments (e.g., Terraform).
Extensive experience operating in hybrid infrastructure environments , including enterprise on‑prem platforms and public/private cloud , including migration enablement and cloud operational patterns.
Hands‑on experience with data stores such as MS SQL Server, Oracle, and Cassandra and/or Cloud Native Databases.
Strong…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).