×
Register Here to Apply for Jobs or Post Jobs. X

Grafana Observability SME

Job in Poughkeepsie, Dutchess County, New York, 12600, USA
Listing for: TechDigital Corporation
Full Time position
Listed on 2026-06-13
Job specializations:
  • IT/Tech
    IT Support, Systems Engineer, Cloud Computing, Cybersecurity
Job Description & How to Apply Below

Top Skills:
1. Production expertise across the full Grafana stack:
Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
2. Strong PromQL, LogQL, and Trace

QL authoring skills; able to write recording rules and SLO queries from scratch.
3. Open Telemetry practitioner — OTLP, collectors, SDK/agent instrumentation for at least three of Java, .NET, Go, Python, Node.js.
4. eBPF-based auto-instrumentation experience with Beyla (or equivalent — Pixie, Cilium Tetragon) in a production context.
5. Experience integrating Grafana alerts into Service Now Event Management (native inbound integration, not webhook-only patterns); familiarity with Service Now ITOM, AIOps event correlation, and CMDB CI attachment.
6. Multi-environment hosting fluency — on-prem, AWS, Azure — and Linux/Windows host agent deployment at scale.
7. Dashboard-as-code and Git Ops patterns (Grafana provisioning, Terraform provider, or Grizzly).
8. Excellent written communication — solution architecture documents, runbooks, and stakeholder-facing status reporting.

Role Summary
Own the end-to-end technical design, build, and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Java, .NET, Go, Python, and Node.js workloads hosted across on-premises data centres, AWS, and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy, alerting design, dashboarding standards, and integration into Service Now ITOM via native Event Management.

Scope is application-level observability only — server and network health remain on Solar Winds, and URL/synthetic monitoring remains on Uptrends.

Key Responsibilities

• Platform architecture and configuration across all eight in-scope Grafana Cloud modules:
Grafana 12 (visualization), Mimir (metrics, 13-month retention), Loki (logs), Tempo (distributed tracing via OTLP), Alloy (telemetry collection agent), Beyla (eBPF zero-code auto-instrumentation), Application Observability (OTel-native APM), and Unified Alerting.

• Tenancy and access design — organizations, folders, teams, role-based access control, dashboard variables, template links, and annotations.

• Application instrumentation strategy by technology stack:
Beyla eBPF as the default zero-code path for Simple and Medium apps;
Open Telemetry SDKs/agents (Java, .NET, Go, Python, Node.js) for Complex apps requiring deeper traces and custom metrics; JMX Exporter, , and runtime-specific exporters where stack-appropriate.

• Log pipeline engineering via Alloy — structured JSON, Log4j/Logback, Serilog, NLog, Windows Event Log, Winston, Pino, loguru — with parsing rules tuned per stack and LogQL-based dashboards and alerts.

• Alerting design — PromQL/LogQL/Trace

QL rules, severity taxonomy, grouping, routing, and notification policies. Build a low-noise, actionable alert feed; tune thresholds iteratively with application owners.

• Single Pane of Glass — design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to Solar Winds and Uptrends.

• Business Dashboards and Reporting — partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.

• Service Now ITOM integration — co-own the design and review of Grafana → Service Now Event Management (native inbound integration) flow: event allow-list governance ("deny by default"), enrichment, deduplication, AIOps correlation, automated incident creation with severity mapping and assignment group rules, CMDB CI attachment, and Service Now-as-master incident state.

Quality assurance authority across all technical deliverables — solution architecture document, instrumentation runbooks, dashboard and alert library, integration test results.

• Phased delivery execution — Mobilise & Client → Application Foundation (ML1) → Onboarding of 40 Simple apps (ML2) → Medium/Complex apps + ITOM Integration (ML2→3) → SPoG, Dashboards & Reporting (ML3→4) → Stabilisation, KT, and post-deployment support (ML4).

• Knowledge transfer — produce platform operating procedures and conduct…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary