Site Reliability Engineer; SRE
Listed on 2026-01-01
-
IT/Tech
Cloud Computing, Systems Engineer
Clearance
Minimum Active Clearance
Job DetailsThe AWS Site Reliability Engineer (SRE) is responsible for the operational health, availability, and performance of the AWS and Databricks environments built by the Platform Engineering team. You prepare and take ownership of "day two" operations, focusing on observability, incident response, and capacity planning. You will design and implement comprehensive monitoring solutions using tools like AWS Cloud Watch to track the health of Databricks clusters, job performance, and underlying AWS resources.
Your goal is to minimize downtime and inefficiencies (manual, repetitive work) by automating operational tasks and recovery procedures. You will define and track Service Level Objectives (SLOs) to balance reliability with innovation as well as create the operations Service Operating Procedures (SOPs).
- Cloud Watch, performance tuning in cloud environments, IaC tools, Databricks management and performance instrumentation
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).