Observability Engineer Job Cleveland area,Ohio USA,IT/Tech

Duration: 6 Months (Potential Extension)

Location: Cleveland, OH area – Hybrid (4 days onsite / 1 day remote)

About the Role

We are seeking an experienced Observability Engineer to support and expand a centralized enterprise observability platform. This initiative is focused on building a true “single pane of glass” monitoring environment using modern telemetry and monitoring technologies including Prometheus, Grafana, and Loki.

The current environment captures approximately 50% of server telemetry and is now evolving to include cross-domain observability across infrastructure, applications, databases, storage, and business transaction data. Long-term goals include enabling AI/ML-driven anomaly detection and intelligent root-cause analysis.

This is an opportunity to play a key role in building an enterprise-wide operational intelligence platform.

Responsibilities

Expand telemetry ingestion across infrastructure, databases, storage platforms, applications, and network environments
Assist with onboarding remaining systems and extending monitoring beyond traditional OS metrics
Build and enhance Grafana dashboards that correlate infrastructure health with application performance and business transaction metrics
Develop and maintain synthetic monitoring scripts using Playwright or similar tools to simulate critical user journeys
Configure and optimize alerting workflows using Alert manager and Loki
Improve signal-to-noise ratio and reduce alert fatigue through better event management practices
Establish and maintain telemetry labeling standards and data quality practices
Support troubleshooting, root-cause analysis, and operational documentation efforts
Partner with engineering and infrastructure teams to drive observability best practices across the enterprise

Required Qualifications

Hands-on experience with:
- Grafana
- Loki
- Alert manager
- Strong experience writing PromQL queries and building Grafana dashboards
Experience designing or supporting enterprise observability and monitoring platforms
Ability to collect and normalize telemetry across:
- Servers
- Databases
- Networks
- Applications
Experience with synthetic monitoring tools such as Playwright or Selenium
Experience editing and managing YAML and JSON configuration files

Knowledge of alert routing, escalation workflows, and reducing alert fatigue
Understanding of telemetry standards, labeling strategy, and data hygiene practices
Strong troubleshooting and analytical skills

Preferred Qualifications

Oracle and SQL database experience
Experience with SNMP, network flow data, or infrastructure performance monitoring
Exposure to AI/ML-based observability or anomaly detection initiatives

This role offers the opportunity to help shape the future of enterprise monitoring and observability while working on high-impact initiatives supporting large-scale infrastructure and application environments.

#J-18808-Ljbffr