Site Reliability Engineer,Datadog Specialist Job Denver area,Colorado USA,IT/Tech

About the Role

Grade Level (for internal use): 09

Site Reliability Engineer – Datadog Specialist

The Team: The IT Operations team at S&P Dow Jones Indices owns and operates the Production systems that power S&P DJI’s global index platforms. Our focus is reliability, visibility, and operational excellence, ensuring critical market-facing services remain available, observable, and resilient.

Responsibilities and Impact:

This role sits at the intersection of Site Reliability Engineering and Observability, focused on the hands-on implementation and operation of enterprise telemetry platforms. The position supports application, infrastructure, and production support teams by ensuring systems are well-instrumented, observable, and diagnosable in Production environments.

We are seeking a hands-on Observability Engineer with strong experience using Datadog and modern telemetry tools. This is not a general Dev Ops or platform engineering role; it is a tool-focused position responsible for implementing, operating, and continuously improving observability across applications, databases, and infrastructure within an established SRE framework.

Own and evolve end-to-end observability using Datadog:

APM, Distributed Tracing, DBM
Log ingestion, parsing, pipelines, and correlation
Synthetic monitoring, RUM (where applicable)
AI-driven alerting, Watchdog, and anomaly detection

Design and enforce monitoring standards:

Alert quality, signal-to-noise reduction
Golden signals, SLO/SLA-aligned monitoring
Consistent tagging, naming, and telemetry hygiene

Serve as the primary Datadog platform specialist:

Dashboards, monitors, service catalog, integrations
Cost visibility and optimization of logs/APM/DBM usage
Enablement and onboarding of application teams

Support production incident response:

Use Datadog, Splunk, and logs to triage incidents
Lead or support root-cause analysis and post-incident reviews
Improve observability gaps identified during incidents
Integrate telemetry with other ITSM tools such as Service Now and Pager Duty to support incident and change workflows

Partner with engineering teams to:

Improve instrumentation (APM, custom metrics, logs)
Adopt Open Telemetry where appropriate
Validate observability during releases and changes
Participate in DR testing, operational readiness reviews, and continuous improvement of SRE/IT Ops practices

Compensation/Benefits Information: (This section is only applicable to US candidates)

S&P Global states that the anticipated base salary range for this position is $90,000 to $122,000. Final base salary for this role will be based on the individual’s geographic location, as well as experience level, skill set, training, licenses and certifications.

In addition to base compensation, this role is eligible for an annual incentive plan. This role is not eligible for additional compensation such as an annual incentive bonus or sales commission plan.

This role is eligible to receive additional S&P Global benefits. For more information on the benefits we provide to our employees, please .

What We’re Looking For:

Basic

Required Qualifications:

4+ years of experience in Observability, SRE, or Production Operations roles
Strong, hands-on Datadog experience: APM, logs, DBM, dashboards, monitors, integrations
Experience working with telemetry concepts:
Metrics, logs, traces, log correlation, distributed tracing
Working knowledge of AWS environments (EC2, ECS, RDS, S3, Dynamo

DB etc)
Ability to read and reason about application code (Java and/or Python) to support instrumentation, troubleshooting, and telemetry design (this is not a feature-development role)
Experience integrating monitoring tools with Pager Duty and Service Now
Strong troubleshooting, documentation, and communication skills

Additional

Preferred Qualifications:

Datadog certifications (APM, Logs, Fundamentals)
Exposure to Splunk, ELK, Dynatrace, or similar tools
Experience with Open Telemetry (instrumentation or collectors)
Familiarity with CI/CD pipelines and containerized workloads
Experience supporting mission-critical, high-availability systems
Financial services, index, or data-platform experience

Location: This role can be hybrid 2-3 days a week at most of…


Increase/decrease your Search Radius (miles)



Job Posting Language

Site Reliability Engineer, Datadog Specialist