Observability Engineer
Listed on 2026-05-11
-
IT/Tech
SRE/Site Reliability, Systems Engineer
THE POSITION
Fan Duel is looking for a Staff Observability Engineer to design, build, and mature the observability ecosystem that underpins our platform and services. You will deliver deep visibility into system behavior by combining system telemetry with user signals to provide a holistic view of performance, reliability, and user experience. You’ll also explore how AI and machine learning can enhance observability, from intelligent alerting and anomaly detection to accelerating root cause analysis.
This is a hands‑on role. You’ll partner closely with engineering and product teams to deliver scalable observability capabilities, serve as a subject‑matter expert in monitoring, alerting, and incident management, and equip teams with self‑service insights and tooling. By connecting system behavior to real user impact and leveraging AI‑assisted workflows to surface issues faster, you’ll drive improvements in reliability, performance, and data‑informed decision‑making across the organization.
In addition to the specific responsibilities outlined above, employees may be required to perform other duties as assigned by the Company to ensure operational flexibility and meet evolving business needs.
THE GAME PLANEveryone on our team has a part to play.
- Contribute in defining and driving the observability strategy and roadmap across multiple teams, aligning with business priorities and engineering goals.
- Design and improve scalable observability capabilities that provide actionable insights into system health, performance, and user experience.
- Establish and standardize best practices for monitoring, alerting, incident management, and post‑mortems across the organization.
- Drive operational excellence by evolving incident management, on‑call practices, and post‑incident learning, ensuring systemic improvements over local fixes.
- Lead cross‑team initiatives to improve end‑to‑end reliability, identifying systemic risks and driving their resolution.
- Leverage automation and AI‑assisted workflows to accelerate root cause analysis and reduce operational toil at scale.
- Partner with engineering and product leadership to translate observability insights into strategic roadmap decisions.
- Identify trends across system and user signals to proactively detect, prevent, and mitigate large‑scale issues.
- Optimize observability platforms for cost, scalability, and long‑term sustainability.
- Mentor engineers and raise the reliability and observability maturity across the organization.
- In addition to the responsibilities outlined above, employees may be required to perform other duties as assigned by the Company to ensure operational flexibility and meet evolving business needs.
AWS, Kubernetes, Terraform, Helm, Ansible, Vault, Datadog, and Pager Duty.
THE STATS- Significant hands‑on experience in observability engineering, SRE, platform engineering, or related roles, with a track record of driving impact beyond individual teams.
- Strong expertise in monitoring and observability, with significant hands‑on experience in Datadog.
- Experience defining and driving observability or reliability strategy across teams or domains.
- Proficiency with Kubernetes, cloud infrastructure (AWS), and infrastructure‑as‑code tools (Terraform).
- Proven ability to influence technical direction and decision‑making across multiple teams and stakeholders.
- Deep understanding of distributed systems principles (e.g. consistency, availability, partition tolerance) and their real‑world trade‑offs.
- Experience defining and implementing SLOs, SLIs, and alerting strategies, including user‑centric and business‑aligned metrics.
- Strong software engineering fundamentals, with proficiency in at least one modern programming language (Go, Java, Python, or Type Script), and the ability to design scalable systems, build tooling and automation, and operate effectively within large, complex codebases.
- Experience driving large‑scale improvements through automation, reducing organizational toil, and eliminating classes of recurring issues.
- Strong analytical skills, with the ability to translate technical signals into business and customer impact.
- Excellent communication…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).