Observability Engineer Job Atlanta area,Georgia USA,IT/Tech

Position: Staff Observability Engineer
THE POSITIONOur roster has an opening with your name on it

Fan Duel is looking for a Staff Observability Engineer to design, build, and mature the observability ecosystem that underpins our platform and services. You will deliver deep visibility into system behavior by combining system telemetry with user signals to provide a holistic view of performance, reliability, and user experience. You'll also explore how AI and machine learning can enhance observability, from intelligent alerting and anomaly detection to accelerating root cause analysis.

This is a hands-on role. You'll partner closely with engineering and product teams to deliver scalable observability capabilities, serve as a subject matter expert in monitoring, alerting, and incident management, and equip teams with self-service insights and tooling. By connecting system behavior to real user impact and leveraging AI-assisted workflows to surface issues faster, you'll drive improvements in reliability, performance, and data-informed decision-making across the organization.

In addition to the specific responsibilities outlined above, employees may be required to perform other such duties as assigned by the Company. This ensures operational flexibility and allows the Company to meet evolving business needs.

THE GAME PLAN
Everyone on our team has a part to play

Contribute in defining and driving the observability strategy and roadmap across multiple teams, aligning with business priorities and engineering goals.
Design and improve scalable observability capabilities that provide actionable insights into system health, performance, and user experience.
Establish and standardize best practices for monitoring, alerting, incident management, and postmortems across the organization.
Drive operational excellence by evolving incident management, on-call practices, and post-incident learning, ensuring systemic improvements over local fixes.
Lead cross-team initiatives to improve end-to-end reliability, identifying systemic risks and driving their resolution.
Leverage automation and AI-assisted workflows to accelerate root cause analysis and reduce operational toil at scale.
Partner with engineering and product leadership to translate observability insights into strategic roadmap decisions.
Identify trends across system and user signals to proactively detect, prevent, and mitigate large-scale issues.
Optimize observability platforms for cost, scalability, and long-term sustainability.
Mentor engineers and raise the reliability and observability maturity across the organization.
In addition to the responsibilities outlined above, employees may be required to perform other duties as assigned by the Company to ensure operational flexibility and meet evolving business needs.

A Sneak Peek Into Our Tech Stack

AWS, Kubernetes, Terraform, Helm, Ansible, Vault, Datadog and Pager Duty

THE STATS
What we're looking for in our next teammate

Significant hands-on experience in observability engineering, SRE, platform engineering, or related roles, with a track record of driving impact beyond individual teams.
Strong expertise in monitoring and observability, with significant hands-on experience in Datadog.
Experience defining and driving observability or reliability strategy across teams or domains.
Proficiency with Kubernetes, cloud infrastructure (AWS), and infrastructure-as-code tools (Terraform).
Proven ability to influence technical direction and decision-making across multiple teams and stakeholders.
Deep understanding of distributed systems principles (e.g. consistency, availability, partition tolerance) and their real-world trade-offs.
Experience defining and implementing SLOs, SLIs, and alerting strategies, including user-centric and business-aligned metrics.
Strong software engineering fundamentals, with proficiency in at least one modern programming language (e.g. Go, Java, Python, or Type Script), and the ability to design scalable systems, build tooling and automation, and operate effectively within large, complex codebases.
Experience driving large-scale improvements through automation, reducing organizational toil, and eliminating classes of recurring issues.
Strong analytical skills, with the ability to translate technical signals into business and customer impact.
Excellent communication and stakeholder management skills, with the ability to influence both technical and non-technical audiences.
A mindset of ownership, with a focus on long-term impact, scalability, and continuous improvement.

Don't check all the boxes? That's okay! We encourage you to still apply if you feel like you possess an adjacent skill set and are interested in learning more about this position.

ABOUT FANDUEL

Fan Duel Group is the premier mobile gaming company in the United States and Canada. Fan Duel Group consists of a portfolio of leading brands across mobile wagering including:
America's #1 Sports book, Fan Duel Sports book; its leading iGaming platform, Fan Duel Casino; the industry's unquestioned leader in horse…