×
Register Here to Apply for Jobs or Post Jobs. X

Lead Software Engineer, Observability - AdTech Leader

Job in Boston, Suffolk County, Massachusetts, 02298, USA
Listing for: Andiamo
Full Time position
Listed on 2026-02-16
Job specializations:
  • Software Development
    Software Engineer
Salary/Wage Range or Industry Benchmark: 200000 - 250000 USD Yearly USD 200000.00 250000.00 YEAR
Job Description & How to Apply Below

Overview

Lead Software Engineer, Observability and Reliability

This role is for an experienced engineer who believes reliability is built through software, not heroics. You will operate at the intersection of systems engineering, platform development, and observability, helping ensure that large scale, customer facing services remain fast, resilient, and predictable as they grow. Your work will directly shape how engineering teams design, ship, and operate critical systems.

The Opportunity

You will join a reliability focused engineering group within a fast growing software platform that supports data driven, highly personalized digital experiences. The environment is complex, distributed, and performance sensitive. Reliability is treated as a product feature, and the team approaches operations as an engineering discipline grounded in automation, measurement, and continuous improvement.

What This Team Does

The reliability and observability team builds core backend services, internal platforms, and automation that allow product engineering teams to release software safely and scale it with confidence. The group partners closely with feature teams, embedding where needed to improve architecture, performance, and operational maturity.

The team also acts as educators and advocates, helping engineers across the organization learn how to debug distributed systems, design self healing services, and push system performance to its practical limits.

Your Impact

As a Lead Software Engineer in Observability and Reliability, you will define how complex production problems are solved and prevented. You will own key technical areas, set direction for reliability improvements, and influence how engineering teams think about availability, scalability, and efficiency.

Your work will improve customer facing stability while also increasing the productivity of product engineers by reducing operational friction, noise, and uncertainty.

What You Will Do

You will design, build, and operate foundational services that enable highly available and scalable systems. You will identify systemic bottlenecks and lead efforts to remove them, achieving meaningful gains in throughput, latency, and resilience.

You will develop tooling, automation, and processes that prevent incidents before they happen, working with partners to address root causes rather than symptoms. You will define and own the technical roadmap for your domain, collaborating with stakeholders to prioritize the highest impact work.

You will write and maintain production software that improves service availability, operational efficiency, and performance. You will work closely with product engineers and other reliability engineers to ship changes that matter.

You will participate in an on call rotation with a strong emphasis on learning, prevention, and alert quality. When issues arise, you will help drive clear diagnosis, resolution, and long term fixes.

You will use data and quantitative analysis to understand system behavior, guide scaling decisions, and measure improvement. You will actively promote reliability best practices through design reviews, documentation, and hands on collaboration.

Technical Environment

The systems you work on run in cloud based environments and rely on technologies such as Python, container orchestration platforms, infrastructure as code, relational and in memory data stores, and Linux based operating systems. Observability, automation, and safe deployment practices are core to how work gets done.

What You Bring

You bring a decade or more of experience in site reliability engineering, platform engineering, or Dev Ops focused roles. You have spent significant time operating production systems and understand how software behaves under real world conditions.

You are comfortable leading through incidents and can guide teams from failure through root cause analysis to durable prevention. You have a strong understanding of Linux systems and networking fundamentals, from the operating system up through application level behavior.

You have experience building software as part of an engineering team and write high quality code in languages such as Python,…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary