×
Register Here to Apply for Jobs or Post Jobs. X

Senior Cloud Engineer, Observability

Job in St. Louis, Saint Louis, St. Louis city, Missouri, 63105, USA
Listing for: Bayer
Full Time position
Listed on 2026-06-26
Job specializations:
  • IT/Tech
    SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 130000 - 180000 USD Yearly USD 130000.00 180000.00 YEAR
Job Description & How to Apply Below
Location: St. Louis

Senior Cloud Engineer, Observability

We’re looking for an entrepreneurial builder who treats observability as a product: paved roads for telemetry, opinionated patterns for dashboards/alerts, and a relentless focus on improving signal quality and reducing time‑to‑detect/time‑to‑recover. You’ll partner with delivery teams, security, and data to standardize how we instrument services, monitor reliability, and learn from production.

YOUR TASKS AND RESPONSIBILITIES

The primary responsibilities of this role are:

Observability Enablement & Support (Primary Focus)
  • Be the hands‑on SME for our observability toolchain (e.g., Datadog, Cloud Watch, Open Search), including log pipelines, tracing/telemetry standards, and platform templates.
  • Run office hours, produce exemplars, and pair with teams to implement known‑good instrumentation and alerting.
  • Triage and resolve observability‑related platform requests (new service onboarding, log/metric gaps, noisy alerts, dashboard standards) with clear ownership and measurable outcomes.
  • Establish and operationalize SLIs/SLOs for key platform components and enable teams to define service SLOs without reinventing the wheel.
Own Observability Paved Roads & Golden Paths
  • Maintain opinionated “golden paths” for logging (standard fields/tags, retention, routing, searchability).
  • Maintain opinionated “golden paths” for metrics (naming conventions, cardinality guardrails, standard RED/USE views).
  • Maintain opinionated “golden paths” for tracing (service maps, critical spans, propagation standards).
  • Maintain opinionated “golden paths” for dashboards (starter dashboards by service type + curated views for platform reliability).
  • Provide reusable templates for alerting patterns (latency, error‑rate, saturation, dependency failures), tuned for actionable paging vs. noise.
Reliability Outcomes (Through Signals, Not Heroics)
  • Reduce MTTR by improving detection, triage paths, runbooks, and “what changed” visibility.
  • Drive reliability reviews focused on observability gaps: missing signals, unclear ownership, bad alerts, and uninstrumented failure modes.
  • Partner with delivery teams to turn recurring incidents into durable fixes (instrumentation + alerting + automation + documentation).
Observability + Dev Sec Ops  Integration
  • Embed observability checks into CI/CD and platform workflows (e.g., telemetry guardrails, dashboard/monitor templates, logging standards checks).
  • Partner with security/compliance to ensure telemetry supports auditability and incident investigation without ad-hoc effort.
Measure, Learn, Iterate (Ownership Mindset)
  • Define and report platform observability KPIs: alert noise rate, % actionable alerts, MTTA/MTTR trends, onboarding time to fully observable, runbook coverage, incident recurrence.
  • Run lightweight experiments to improve signal quality (threshold tuning, monitor redesign, dashboard UX), and ship improvements like a product owner.
Cost Stewardship for Telemetry (Fin Ops‑Aware Observability)
  • Create cost‑aware telemetry standards (log volume controls, metric cardinality guidance, sampling strategies, retention tiers).
  • Help teams optimize spend while improving reliability outcomes (“cheaper + better” logging/metrics patterns).
Collaboration & Mentorship
  • Serve as a trusted partner to delivery units, security, and data—turning pain points into paved‑road improvements.
  • Mentor engineers and uplift organizational practices for incident response, reliability signals, and operational excellence.
WHO YOU ARE

Required:

  • Bachelor’s in computer science/engineering or equivalent experience.
  • 5+ years hands‑on AWS experience operating production workloads.
  • Deep practical experience with observability in production, including Datadog and/or Cloud Watch (dashboards, monitors/alerts, log search, correlation).
  • Designing actionable alerts (noise reduction, ownership, runbook‑first alerts).
  • Defining/using SLIs/SLOs and reliability metrics to drive behavior.
  • Strong proficiency with Infrastructure as Code (Terraform; Cloud Formation a plus).
  • Strong programming for automation/tooling (Python, Go, or similar).
  • Solid grasp of cloud architecture, networking, and security fundamentals.

Preferred:

  • Experience productizing…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary