Cloud Engineer - Senior; Observability - Datadog
Listed on 2026-06-21
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer
Job Description
The Cloud Engineer
- Senior (Observability
- Datadog) supports the SEC ISS contract by engineering, operating, and continuously improving the enterprise observability platform across hybrid cloud and containerized environments. This role is hands‑on: instruments services with distributed tracing, code-level profiling, and custom metrics; builds and tunes Datadog (or comparable) dashboards, alerts, APM, log pipelines, RUM, and synthetic monitors; then uses that telemetry to solve production performance, reliability, and capacity problems.
The engineer partners with cloud, platform, and application teams to embed observability into Azure, AWS, and container platforms (Open Shift/Kubernetes), and drives reduction of alert noise, mean time to detect (MTTD), and mean time to resolve (MTTR). This position provides senior technical leadership for APM/distributed tracing strategy, SLO/SLI engineering, and data‑driven operational decision‑making in a 24x7x365 operating environment.
STRONG DATADOG EXPERIENCE NEEDED
Observability Platform Engineering- Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring.
- Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise.
- Instrument services, infrastructure, and containerized workloads using agents, Open Telemetry, and language‑specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C Trace Context propagation, and unified service tagging across the estate.
- Develop and maintain integrations between observability platforms, ITSM (Service Now), CI/CD pipelines, and on‑call/paging workflows.
- Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost‑controlled.
- Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud‑native data services.
- Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and No
SQL/cache services (DynamoDB, Cosmos DB, Elasti Cache/Redis), including query‑level performance analytics, slow‑query and execution‑plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health – correlating database spans with upstream APM traces. - Engineer container‑platform observability for Open Shift/Kubernetes, covering cluster health, control plane, nodes, pods, name spaces, ingress, service mesh, and workload APM.
- Build standardized, reusable monitoring modules deployable via infrastructure‑as‑code (Terraform, Bicep, ARM) and CI/CD.
- Support hybrid visibility across on‑premises, cloud, and containerized workloads with correlated telemetry.
- Lead data‑driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate.
- Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM‑to‑backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies.
- Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence.
- Define and implement trace‑based SLOs, deployment tracking, and change‑correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes.
- Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root‑cause analysis, and owning post‑incident observability gaps.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).