More jobs:
Senior Lead Site Reliability Engineer
Job in
Glasgow, Glasgow City Area, G1, Scotland, UK
Listed on 2026-05-23
Listing for:
JPMorganChase
Full Time
position Listed on 2026-05-23
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top‑notch reliability and observability for our most critical platforms.
As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Commercial & Investment Bank, you are an integral part of an agile team that works to enhance, build, and deliver trusted market‑leading technology products in a secure, stable, and scalable way.
Job responsibilities- Regularly provides technical guidance and direction on site reliability practices to support the business and its technical teams, contractors, and vendors
- Develops secure and high‑quality production code for reliability tooling and telemetry pipelines, and reviews and debugs code written by others
- Drives decisions that influence reliability design, observability architecture, application functionality, and technical operations and processes
- Serves as a function‑wide subject matter expert in one or more areas of site reliability, observability, or telemetry engineering
- Leads resiliency design reviews and breaks up complex reliability problems into digestible work for other engineers, acting as a technical lead for large‑sized products
- Acts as the main point of contact during major incidents, demonstrating the skills to identify and solve issues quickly to avoid financial losses, and champions blameless postmortem culture
- Collaborates with team members and stakeholders to define comprehensive service level indicators, service level objectives, and error budgets
- Designs, implements, and maintains operational reliability for large‑scale Open Telemetry pipelines on hybrid on‑prem/cloud environments, supporting telemetry ingestion, processing, and export to backends such as Influx
DB, Prometheus, Elasticsearch, and Open Search - Drives the assessment, refactoring, and incremental migration of custom legacy telemetry collection code to standardized Open Telemetry instrumentation, reducing technical debt while maintaining system stability
- Actively contributes to the engineering community as an advocate of firm‑wide frameworks, tools, and practices, and influences peers and project decision‑makers to consider the use and application of leading‑edge observability and reliability technologies
- Adds to the team culture of diversity, opportunity, inclusion, and respect
- Formal training or certification on software engineering concepts and advanced applied experience delivering system design, application development, testing, and operational stability
- Advanced knowledge of reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices, with considerable in‑depth knowledge in one or more technical disciplines (e.g., cloud, observability, distributed systems, etc.)
- Advanced proficiency in one or more programming languages (e.g., Java, Python, Go, etc.)
- Advanced proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, Elasticsearch, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, Git Lab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Hands‑on experience with the design, deployment, and operation of Open Telemetry collectors in production environments, focusing on technical aspects such as configuring, optimizing, and troubleshooting OTLP endpoints and receivers
- Ability to tackle reliability design and functionality problems independently with little to no oversight
- Practical cloud native experience
- Ability to expand and collaborate across different levels and stakeholder groups
- Knowledge of distributed tracing, metrics, and logging best practices
- Certification in AWS, Kubernetes, or relevant technologies
- Proven track record in system health monitoring, capacity management, and blameless postmortems for high‑availability services
- Deep understanding of distributed system…
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×