Lead Infrastructure Engineer- Infrastructure Monitoring
Listed on 2026-04-27
-
IT/Tech
Systems Engineer, Cybersecurity, IT Support
We have an exciting opportunity for you to collaborate with passionate professionals, solve complex problems, and grow your career in a supportive, innovative environment.
As a Lead Infrastructure Engineer at JPMorgan Chase within Corporate Technology's Enterprise Observability Platforms, you will help build and operate a strategic, market-leading Infrastructure Monitoring platform that strengthens critical service resilience and delivers trusted operational insights. You will be a hands-on technical contributor on an high-performing agile team, building secure, stable, and scalable observability solutions-turning telemetry into actionable insights, modernizing event-to-incident workflows, enabling automation and AIOps-driven reliability improvements aligned to the firm's business objectives.
Job responsibilities
Engineer, operate, and continuously improve the firm's Infrastructure Monitoring platforms, ensuring availability, performance, scalability, and security.
Build and run enterprise-grade Infrastructure Monitoring capabilities across Linux, Windows, and complex Network estates, including platform-level onboarding and lifecycle management.
Design and implement platform services, integrations, and telemetry collection across metrics, logs, events, including Open Telemetry collection patterns where applicable.
Develop and maintain standardized onboarding patterns (agents/collectors, configurations, dashboards, alert policies) to accelerate safe adoption at scale.
Improve monitoring signal quality and usability through baselining, threshold strategy, noise reduction, enrichment, and topology/context alignment.
Develop secure, high-quality automation and production code; review, debug, and improve code/configuration written by others.
Automate platform operations and reduce toil through scripting and CI/CD-driven configuration management; implement infrastructure-as-code deployment patterns
Manage & maintain production health for the monitoring platform: lead triage, perform RCA, and deliver preventative engineering and resilience improvements.
Partner with infrastructure, application, and SRE teams to align platform capabilities to SLIs/SLOs, operational readiness, and continuous improvement goals.
Contribute to a culture of diversity, opportunity, inclusion, and respect. Required qualifications, capabilities, and skills
Formal training or certification on infrastructure engineering concepts and 5+ years applied experience
Proficiency with enterprise operating systems (Linux and/or Windows), including administration, troubleshooting, performance analysis, and operational best practices within regulated production environments.
Proven hands-on experience delivering and operating enterprise-scale Infrastructure Monitoring solutions across Linux, Windows, and/or Network estates
Solid understanding and hands-on implementation of observability and telemetry concepts , including metrics, logs, and events, with experience using Open Telemetry collection patterns and integrating telemetry into Downstream components
Proficiency in automation and engineering practices , including scripting and development with Python, Ansible, Power Shell / Bash, and applying CI/CD-driven workflows for controlled, secure, and repeatable change management.
Well-rounded experience in infrastructure across hardware platforms, operating systems, networking, storage, and databases (MS SQL Server, Oracle, Cassandra), including common deployment patterns, integration architectures, scaling and resiliency considerations, and performance assessment.
Experience implementing Infrastructure-as-Code (IaC) and configuration management practices using tools such as Terraform, enabling standardized provisioning and scalable, repeatable deployments.
Hands-on experience operating in hybrid infrastructure environments , including enterprise on-prem platforms and public/private cloud, with familiarity supporting and migrating monitoring capabilities across cloud boundaries.
Demonstrated ability to improve monitoring signal quality through baselining, threshold strategy, noise reduction, enrichment, and topology/context alignment, supporting reliable…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).