Databricks Site Reliability Engineer
Listed on 2026-06-05
-
IT/Tech
Cloud Computing, Data Engineer
It is exciting to work for a company that makes the world measurably better. We are committed to bringing safety, quality, and customer focus to the advanced ceramics manufacturing business.
Software Site Reliability EngineerAs the Site Reliability Engineer, you will support Coors Tek's Databricks application and data product strategy by ensuring solutions built, migrated, and deployed on Databricks are reliable, secure, observable, supportable, and cost‑effective in production. This role is not solely focused on monitoring and operational support. You will actively develop automation, platform tooling, deployment pipelines, observability capabilities, and reliability solutions that reduce operational toil and improve the scalability of Databricks-hosted applications and data products.
Joined with Data & Analytics, you will partner with Architecture, Cybersecurity, Infrastructure, Manufacturing IT/OT, Enterprise Applications, citizen developers, and business teams to support production reliability for Databricks-hosted applications (pattern B), analytics products, workflows, and AI-enabled solutions.
Roles And Responsibilities- Support production reliability, operational readiness, and lifecycle support for Databricks-hosted applications, data products, dashboards, notebooks, jobs, workflows, APIs, and AI-enabled solutions.
- Support applications migrated to Databricks, built directly in Databricks, or promoted from citizen development and IT development into governed production patterns.
- Execute intake, review, handoff, support, and release practices for Pattern B Databricks applications, including minimum requirements before production deployment.
- Partner with citizen developers, IT developers, data engineers, enterprise architects, and business stakeholders to convert prototypes into reliable, monitored, documented, and supportable services.
- Implement and maintain observability standards, including logging, alerting, health checks, SLIs/SLOs, lineage, usage monitoring, cost monitoring, and operational dashboards.
- Respond to incidents, coordinate troubleshooting, participate in root cause analysis and support corrective actions for failed jobs, broken pipelines, access issues, performance issues, data refresh failures, and application outages.
- Maintain and update runbooks, support procedures, escalation paths, ownership models, service catalogs, and knowledge articles for Databricks applications and data products.
- Partner with Data & Analytics on Databricks workflows, Delta Lake, Unity Catalog, data lineage, permissions, SQL warehouses, jobs, clusters, serverless capabilities, and performance tuning.
- Partner with Cybersecurity and Architecture to ensure Databricks solutions meet standards for identity, access, secrets management, logging, data classification, responsible AI, and least‑privilege access.
- Support CI/CD, testing, environment promotion, release controls, rollback procedures, and change management for Databricks applications and related Azure or integration components.
- Identify recurring failure patterns and assist with automating manual support work, reducing operational toil, and creating reusable templates and standards.
- Advise teams on production‑ready design, including resiliency, scalability, maintainability, cost control, data quality checks, monitoring hooks, and clear ownership.
- Collaborate with manufacturing, finance, supply chain, quality, and other business teams to understand impact, prioritize recovery, and maintain trust in critical Databricks-supported solutions.
- Support governance for citizen-built solutions by ensuring business-created applications have appropriate documentation, testing evidence, security review, support model, and IT transition plan before broad use.
- Monitor and problem solve service health, support metrics, incidents, problem records, platform risks, and improvement backlog items for Databricks applications and data products.
- Design and develop automation, self‑healing workflows, monitoring integrations, and operational tooling using Python and cloud‑native technologies.
- Bachelor's degree in Computer Science, Information Technology,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).