×
Register Here to Apply for Jobs or Post Jobs. X

Data Reliability Engineer

Job in Tallahassee, Franklin County, Florida, 32318, USA
Listing for: Maximus
Full Time position
Listed on 2026-06-18
Job specializations:
  • IT/Tech
    Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Data Engineering, Azure
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Location: Tallahassee

Key Responsibilities

  • Own and improve the reliability, availability, observability, and operational supportability of Maximus UK's Azure Databricks platform, Azure data services, and associated data pipelines and data products.
  • Design and implement monitoring, alerting, health checks, and diagnostics across Azure Databricks, Azure data services, orchestration layers, storage, and downstream consumption, extending these patterns into AWS as the estate grows.
  • Define and maintain reliability standards, controls, operational runbooks, and support models that improve the resilience, predictability, and supportability of data services.
  • Work closely with data engineering teams to identify, prioritise, and remediate reliability, performance, and data quality issues across Databricks notebooks, jobs, workflows, and other Azure data workloads.
  • Establish proactive incident detection, triage, and root cause analysis practices, reducing mean time to detect and mean time to recover for data-related issues.
  • Design and implement robust data quality controls, validation frameworks, reconciliation processes, and anomaly detection approaches across the end-to-end data lifecycle.
  • Configure and use Azure Purview to provide effective data cataloguing, lineage, ownership, and governance, ensuring reliability and quality controls are visible and auditable.
  • Collaborate with platform, cloud, architecture, and security teams to ensure the data estate is secure, resilient, cost-effective, and aligned to enterprise standards and patterns.
  • Contribute to the reliability engineering approach for an Azure-first data platform while supporting reusable patterns and operational readiness for data services in AWS.
  • Partner with architects and engineers so that new pipelines, data products, and platform services are designed with operability, recoverability, scalability, and observability built in from the start.
  • Automate repetitive operational tasks, environment checks, dependency verification, failure handling, and recovery processes to increase efficiency and reduce manual intervention and risk.
  • Capture lessons learned, codify reliability patterns and standards, and share best practice to continuously improve reliability, transparency, and engineering discipline across the data function.
Essential Skills
- What You'll Bring
  • Proven experience in data engineering, platform engineering, site reliability engineering, Data Ops, or a closely related role focused on data platform reliability and operations.
  • Strong hands‑on experience with Azure-based data platforms, particularly Azure Databricks and core Azure data services such as Data Lake Storage, Data Factory/Synapse, and analytical stores, with familiarity of equivalent services in AWS.
  • Strong understanding of modern data platform architectures, including data lakes, warehouses or lake houses, orchestration frameworks, transformation pipelines, streaming services, and analytical consumption layers.
  • Experience designing and implementing monitoring, observability, logging, alerting, and incident management approaches for Databricks workloads and Azure data services, using tools such as Azure Monitor, Log Analytics, or similar.
  • Strong understanding of data quality, reconciliation, validation, and lineage concepts, and practical experience implementing control frameworks that protect critical data flows and products.
  • Hands‑on experience with Azure Purview or comparable data cataloge and lineage tooling, including configuration of collections, classification, ownership, and lineage for key datasets.
  • Good understanding of reliability engineering principles such as availability targets, resilience patterns, recoverability, service health indicators, and operational readiness assessments.
  • Experience using scripting and automation (for example, Python, Power Shell, or similar) to remove operational toil, improve repeatability, and strengthen recovery and deployment processes.
  • Ability to diagnose and resolve complex issues that span data pipelines, integrations, cloud infrastructure, configuration, and source or downstream systems, and to drive pragmatic remediation.
  • Strong collaboration and communication…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary