Observability Engineer
Listed on 2026-06-05
-
IT/Tech
Systems Engineer, IT Support
Duration: 6 Months (Potential Extension)
Location: Cleveland, OH area – Hybrid (4 days onsite / 1 day remote)
About the RoleWe are seeking an experienced Observability Engineer to support and expand a centralized enterprise observability platform. This initiative is focused on building a true “single pane of glass” monitoring environment using modern telemetry and monitoring technologies including Prometheus, Grafana, and Loki.
The current environment captures approximately 50% of server telemetry and is now evolving to include cross-domain observability across infrastructure, applications, databases, storage, and business transaction data. Long-term goals include enabling AI/ML-driven anomaly detection and intelligent root-cause analysis.
This is an opportunity to play a key role in building an enterprise-wide operational intelligence platform.
Responsibilities- Expand telemetry ingestion across infrastructure, databases, storage platforms, applications, and network environments
- Assist with onboarding remaining systems and extending monitoring beyond traditional OS metrics
- Build and enhance Grafana dashboards that correlate infrastructure health with application performance and business transaction metrics
- Develop and maintain synthetic monitoring scripts using Playwright or similar tools to simulate critical user journeys
- Configure and optimize alerting workflows using Alert manager and Loki
- Improve signal-to-noise ratio and reduce alert fatigue through better event management practices
- Establish and maintain telemetry labeling standards and data quality practices
- Support troubleshooting, root-cause analysis, and operational documentation efforts
- Partner with engineering and infrastructure teams to drive observability best practices across the enterprise
- Hands-on experience with:
- Grafana
- Loki
- Alert manager
- Strong experience writing PromQL queries and building Grafana dashboards
- Experience designing or supporting enterprise observability and monitoring platforms
- Ability to collect and normalize telemetry across:
- Servers
- Databases
- Networks
- Applications
- Experience with synthetic monitoring tools such as Playwright or Selenium
- Experience editing and managing YAML and JSON configuration files
Knowledge of alert routing, escalation workflows, and reducing alert fatigue - Understanding of telemetry standards, labeling strategy, and data hygiene practices
- Strong troubleshooting and analytical skills
- Oracle and SQL database experience
- Experience with SNMP, network flow data, or infrastructure performance monitoring
- Exposure to AI/ML-based observability or anomaly detection initiatives
This role offers the opportunity to help shape the future of enterprise monitoring and observability while working on high-impact initiatives supporting large-scale infrastructure and application environments.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).