Site Reliability Engineer, Datadog Specialist
Listed on 2026-02-28
-
IT/Tech
IT Support, Systems Engineer, Cybersecurity, Cloud Computing
About the Role
Grade Level (for internal use): 09
Site Reliability Engineer – Datadog Specialist
The Team: The IT Operations team at S&P Dow Jones Indices owns and operates the Production systems that power S&P DJI’s global index platforms. Our focus is reliability, visibility, and operational excellence, ensuring critical market-facing services remain available, observable, and resilient.
Responsibilities and Impact:
This role sits at the intersection of Site Reliability Engineering and Observability, focused on the hands-on implementation and operation of enterprise telemetry platforms. The position supports application, infrastructure, and production support teams by ensuring systems are well-instrumented, observable, and diagnosable in Production environments.
We are seeking a hands-on Observability Engineer with strong experience using Datadog and modern telemetry tools. This is not a general Dev Ops or platform engineering role; it is a tool-focused position responsible for implementing, operating, and continuously improving observability across applications, databases, and infrastructure within an established SRE framework.
Own and evolve end-to-end observability using Datadog:
- APM, Distributed Tracing, DBM
- Log ingestion, parsing, pipelines, and correlation
- Synthetic monitoring, RUM (where applicable)
- AI-driven alerting, Watchdog, and anomaly detection
Design and enforce monitoring standards:
- Alert quality, signal-to-noise reduction
- Golden signals, SLO/SLA-aligned monitoring
- Consistent tagging, naming, and telemetry hygiene
Serve as the primary Datadog platform specialist:
- Dashboards, monitors, service catalog, integrations
- Cost visibility and optimization of logs/APM/DBM usage
- Enablement and onboarding of application teams
Support production incident response:
- Use Datadog, Splunk, and logs to triage incidents
- Lead or support root-cause analysis and post-incident reviews
- Improve observability gaps identified during incidents
- Integrate telemetry with other ITSM tools such as Service Now and Pager Duty to support incident and change workflows
Partner with engineering teams to:
- Improve instrumentation (APM, custom metrics, logs)
- Adopt Open Telemetry where appropriate
- Validate observability during releases and changes
- Participate in DR testing, operational readiness reviews, and continuous improvement of SRE/IT Ops practices
Compensation/Benefits Information: (This section is only applicable to US candidates)
S&P Global states that the anticipated base salary range for this position is $90,000 to $122,000. Final base salary for this role will be based on the individual’s geographic location, as well as experience level, skill set, training, licenses and certifications.
In addition to base compensation, this role is eligible for an annual incentive plan. This role is not eligible for additional compensation such as an annual incentive bonus or sales commission plan.
This role is eligible to receive additional S&P Global benefits. For more information on the benefits we provide to our employees, please .
What We’re Looking For:
Basic
Required Qualifications:
- 4+ years of experience in Observability, SRE, or Production Operations roles
- Strong, hands-on Datadog experience: APM, logs, DBM, dashboards, monitors, integrations
- Experience working with telemetry concepts:
Metrics, logs, traces, log correlation, distributed tracing - Working knowledge of AWS environments (EC2, ECS, RDS, S3, Dynamo
DB etc) - Ability to read and reason about application code (Java and/or Python) to support instrumentation, troubleshooting, and telemetry design (this is not a feature-development role)
- Experience integrating monitoring tools with Pager Duty and Service Now
- Strong troubleshooting, documentation, and communication skills
Additional
Preferred Qualifications:
- Datadog certifications (APM, Logs, Fundamentals)
- Exposure to Splunk, ELK, Dynatrace, or similar tools
- Experience with Open Telemetry (instrumentation or collectors)
- Familiarity with CI/CD pipelines and containerized workloads
- Experience supporting mission-critical, high-availability systems
- Financial services, index, or data-platform experience
Location: This role can be hybrid 2-3 days a week at most of…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).