Observability Platform Engineer
Listed on 2026-06-14
-
IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, IT Support, SRE/Site Reliability
Neuberger's Technology team is seeking an Observability Engineer to lead and evolve our observability strategy across cloud and on-premises environments. You will serve as the primary owner and subject matter expert for our Datadog platform — building, scaling, and operating a comprehensive monitoring solution that continuously validates service health (24/7) across business‑critical systems, including external websites and key infrastructure components (e.g., firewalls, Open Shift).
You will design and implement end‑to‑end observability solutions spanning logs, metrics, traces, Service Level Objectives (SLOs), synthetic monitoring, and Real User Monitoring (RUM) to improve reliability, accelerate incident response, and deliver clear visibility into service performance.
This is an individual contributor role with strong Datadog engineering and scripting expectations — not a pure administrator role, though prior admin experience is beneficial. You will partner closely with application, SRE/Dev Ops, infrastructure, and security teams and serve as the internal champion and evangelist for Datadog adoption, standards, and best practices. The environment includes an active migration from Open View to Datadog, with workflows integrating into Service Now for incident routing and escalation.
WhatYou’ll Do
- Serve as the primary Datadog platform owner — architecting, building, and maintaining scalable observability solutions across cloud and on‑prem environments (Windows and Linux/Unix), with direct ownership of monitoring capabilities for key applications and services.
- Partner closely with application, Dev Ops, SRE/operations, infrastructure, and security teams to translate reliability goals into actionable Datadog monitoring strategies, dashboards, SLOs, and alerting frameworks.
- Lead and execute the migration from Open View to Datadog, ensuring continuity of coverage and an improvement in monitoring fidelity across all migrated services and infrastructure.
- Develop and automate processes using Datadog's APIs, Terraform provider, and scripting (Python, Power Shell, Bash) to manage monitors, dashboards, alerts, and telemetry configuration at scale — ensuring consistency across Windows Server and Unix (Linux/Solaris) environments.
- Implement and optimize Datadog APM, distributed tracing, log management, infrastructure monitoring, and Network Performance Monitoring (NPM) to provide full‑stack visibility.
- Build and evolve Datadog RUM and Synthetic Monitoring capabilities to track end‑user experience and proactively validate availability of external‑facing services and critical workflows.
- Define and operationalize SLOs and error budgets within Datadog; drive alert noise reduction through correlation, enrichment, threshold tuning, and monitor dependency mapping.
- Integrate Datadog with Service Now for incident/problem ticket routing and escalation; produce runbooks, post‑incident reviews, and executive/operational dashboards to support reliability reporting.
- Champion Open Telemetry (OTel) adoption and drive consistent logging, metrics, and tracing standards across the engineering organization using Datadog as the central observability platform.
- Onboard new applications and services into Datadog; provide guidance and enablement to engineering teams on instrumentation, agent deployment, and observability best practices.
- Collaborate on platform cost optimization, data governance, and scaling strategies to ensure Datadog remains performant and cost‑effective as the environment grows.
- BS/BA in Computer Science, Information Systems, Engineering, or equivalent experience.
- 5+ years in Observability, APM, SRE, or Platform Engineering — with at least 2–3 years of hands‑on, production‑grade Datadog experience.
- Deep expertise across Datadog's core product suite: APM, Infrastructure Monitoring, Log Management, Synthetics, RUM, SLOs, Dashboards, Monitors, and Alerting.
- Proficiency in both Windows Server and Unix (Linux/Solaris) environments, including agent deployment, service instrumentation, and OS‑level performance analysis.
- Strong scripting and automation skills (Python, Power Shell, Bash) with hands‑on…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).