Senior Scalability Engineer - Observability Job Denver area,Colorado USA,IT/Tech

About Judi Health

Judi Health is an enterprise health technology company providing a comprehensive suite of solutions for employers and health plans, including:

Capital Rx, a public benefit corporation delivering full-service pharmacy benefit management (PBM) solutions to self-insured employers,
Judi Health™, which offers full-service health benefit management solutions to employers, TPAs, and health plans, and
Judi®, the industry's leading proprietary Enterprise Health Platform (EHP), which consolidates all claim administration-related workflows in one scalable, secure platform.

Together with our clients, we're rebuilding trust in healthcare in the U.S. and deploying the infrastructure we need for the care we deserve. To learn more, visit (Use the "Apply for this Job" box below)..

Location:

Remote

Position Summary:

Our Scalability team as a Senior Scalability Engineer focused on observability platform development and engineering productivity. In this role, you will define, own, and build Judi Health's organization-wide observability strategy, tooling, and platform products. Beyond maintaining infrastructure, you'll architect and develop a custom observability platform that gives engineering teams powerful, fast, and cost-effective visibility into every layer of our infrastructure-from application logs and metrics to distributed traces.

You'll build production-grade internal products using React/Type Script frontends with Python and Rust backends, creating tools that fundamentally improve how engineers at Judi Health debug, monitor, and optimize their systems. Working closely with leadership and cross-functional teams, your work will be foundational to platform stability, performance optimization, and developer productivity across our rapidly growing healthcare platform.

Position Responsibilities:

In this role, you'll own the observability infrastructure that powers our engineering organization. You will:

Architect observability platform:
Design, implement, and maintain the LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) as the primary observability platform across all engineering teams, making architectural decisions that balance cost, performance, and developer experience.
Build internal observability products:
Design and develop production-grade internal platform products with React/Type Script frontends and Python/Rust backends that provide engineers with powerful log search, metrics visualization, and trace analysis capabilities.
Develop custom log indexing systems:
Architect and build high-performance log indexing solutions using Rust that process logs and provide sub-second search across billions of log lines at a fraction of the cost.
Integrate SQL analytics for logs:
Design and implement solutions leveraging AWS Athena or similar SQL query engines (DuckDB, Click House) for ad-hoc log analysis and historical queries, enabling engineers to run complex SQL queries over S3-based log data for deep investigations and trend analysis.
Create advanced query interfaces:
Build sophisticated web interfaces that allow engineers to query logs, metrics, and traces with features like saved queries, query templates, correlation analysis, and pattern detection, supporting both full-text search and SQL-based analytics.
Balance cloud-native and open-source:
Architect solutions that thoughtfully leverage both AWS-managed services (Cloud Watch, Athena, Kinesis) and open-source tooling (LGTM stack, Quickwit) to optimize for cost, performance, and operational flexibility based on use case requirements.
Integrate AWS observability:
Design seamless integration between AWS Cloud Watch Logs/Metrics and our custom observability platform, providing unified visibility across managed and self-hosted infrastructure.
Build intelligent alerting:
Develop smart dashboards, monitors, and alerting systems that reduce noise, detect anomalies, and help teams respond to incidents quickly.
Partner with engineering teams:
Work directly with product teams to integrate observability into their services, establish logging and metrics standards, and instrument code effectively, serving as the observability subject matter expert.
Enable performance optimization:
Provide the observability foundation that allows the Scalability team to identify performance bottlenecks, track optimization impact, and measure platform stability with data-driven insights.
Establish observability standards:
Define and document comprehensive observability standards including structured logging patterns, metric naming conventions, trace instrumentation, dashboard design principles, and query best practices.
Drive platform adoption:
Lead workshops, create documentation, and build self-service tooling that democratizes observability across engineering, making it easy for teams to adopt best practices.
Demonstrate technical leadership:
Mentor engineers on observability practices, lead architecture reviews for instrumentation approaches, and represent the Scalability team in cross-functional planning.
Work in an…