Lead Software Engineer, Observability - AdTech Leader
Listed on 2026-02-16
-
Software Development
Software Engineer
Overview
Lead Software Engineer, Observability and Reliability
This role is for an experienced engineer who believes reliability is built through software, not heroics. You will operate at the intersection of systems engineering, platform development, and observability, helping ensure that large scale, customer facing services remain fast, resilient, and predictable as they grow. Your work will directly shape how engineering teams design, ship, and operate critical systems.
The OpportunityYou will join a reliability focused engineering group within a fast growing software platform that supports data driven, highly personalized digital experiences. The environment is complex, distributed, and performance sensitive. Reliability is treated as a product feature, and the team approaches operations as an engineering discipline grounded in automation, measurement, and continuous improvement.
What This Team DoesThe reliability and observability team builds core backend services, internal platforms, and automation that allow product engineering teams to release software safely and scale it with confidence. The group partners closely with feature teams, embedding where needed to improve architecture, performance, and operational maturity.
The team also acts as educators and advocates, helping engineers across the organization learn how to debug distributed systems, design self healing services, and push system performance to its practical limits.
Your ImpactAs a Lead Software Engineer in Observability and Reliability, you will define how complex production problems are solved and prevented. You will own key technical areas, set direction for reliability improvements, and influence how engineering teams think about availability, scalability, and efficiency.
Your work will improve customer facing stability while also increasing the productivity of product engineers by reducing operational friction, noise, and uncertainty.
What You Will DoYou will design, build, and operate foundational services that enable highly available and scalable systems. You will identify systemic bottlenecks and lead efforts to remove them, achieving meaningful gains in throughput, latency, and resilience.
You will develop tooling, automation, and processes that prevent incidents before they happen, working with partners to address root causes rather than symptoms. You will define and own the technical roadmap for your domain, collaborating with stakeholders to prioritize the highest impact work.
You will write and maintain production software that improves service availability, operational efficiency, and performance. You will work closely with product engineers and other reliability engineers to ship changes that matter.
You will participate in an on call rotation with a strong emphasis on learning, prevention, and alert quality. When issues arise, you will help drive clear diagnosis, resolution, and long term fixes.
You will use data and quantitative analysis to understand system behavior, guide scaling decisions, and measure improvement. You will actively promote reliability best practices through design reviews, documentation, and hands on collaboration.
Technical EnvironmentThe systems you work on run in cloud based environments and rely on technologies such as Python, container orchestration platforms, infrastructure as code, relational and in memory data stores, and Linux based operating systems. Observability, automation, and safe deployment practices are core to how work gets done.
What You BringYou bring a decade or more of experience in site reliability engineering, platform engineering, or Dev Ops focused roles. You have spent significant time operating production systems and understand how software behaves under real world conditions.
You are comfortable leading through incidents and can guide teams from failure through root cause analysis to durable prevention. You have a strong understanding of Linux systems and networking fundamentals, from the operating system up through application level behavior.
You have experience building software as part of an engineering team and write high quality code in languages such as Python,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).