Senior Site Reliability Engineer
Listed on 2026-02-06
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Enterprise Technology plays a critical part in shaping the future of mobility. If you’re looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance customer experience and improve people’s lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create vehicles that are as smart as you are.
In this position...
As a Senior Site Reliability Engineer, you will be instrumental in ensuring the reliability, performance, and scalability of the critical Ford Service Reservation Platform and its associated applications. This role demands a deep focus on SRE and platform engineering principles, advanced observability, robust automation, and proactive incident management.
Based in Dearborn, MI, this is a hybrid position with a required four-day onsite presence each week.
ResponsibilitiesWhat you'll do...
SRE Leadership & Strategy:
- Lead the implementation and continuous evolution of Site Reliability Engineering (SRE) practices to ensure exceptional high availability, performance, and scalability for the Ford Service Reservation Platform and its applications.
- Define, implement, and rigorously maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for key services, directly aligning reliability goals with critical business and customer outcomes.
- Generate regular SLO and error budget reports, collaborating closely with engineering teams to strategically prioritize reliability work, incident follow-ups, and targeted technical debt reduction efforts.
- Lead weekly status and reliability reviews, effectively communicating risks, performance trends, and improvement opportunities to key stakeholders in engineering and product.
- Champion data-driven decision-making, leveraging observability insights to significantly improve incident response, reduce Mean Time to Resolution (MTTR), and enhance the overall customer experience.
Observability & Monitoring:
- Own, evolve, and optimize comprehensive observability solutions, primarily utilizing Dynatrace for full-stack visibility, Real User Monitoring (RUM), synthetic monitoring, and infrastructure monitoring across critical user journeys of the Ford Service Reservation Platform.
- Design and implement robust Google Cloud Platform (GCP) observability patterns for logs, metrics, alerts, and dashboards specifically tailored for the Ford Service Reservation Platform and its associated applications.
- Leverage Dynatrace and GCP log analytics insights to proactively drive incident reduction, facilitate efficient root cause analysis, and foster continuous performance improvements across all Ford Service Reservation services.
Automation & Infrastructure as Code (IaC):
- Develop and deploy infrastructure as code using Terraform scripts for the provisioning and management of GCP resources, including networking, load balancing, and monitoring artifacts etc.
- Configure and maintain essential Dev Sec Ops tools such as Sonar Qube, FOSSA, Cycode, and 42 Crunch to ensure code quality and security.
- Build reusable, scalable Terraform modules to automate the provisioning of GCP monitoring artifacts, including log-based metrics, alerting policies, uptime checks, and comprehensive dashboards.
- Develop and maintain robust CI/CD pipelines utilizing Tekton PAC and/or Git Hub Actions for application code deployment, automated operational tasks (e.g., instance management, cache invalidation, and data backups), and infrastructure changes.
- Manage Git Hub repositories for application code, automation scripts, and configuration management.
Incident & Problem Management:
- Establish and continually refine Incident Management and Problem Management processes, coordinating effectively with application teams for rapid resolution and thorough root cause analysis of issues.
- Identify systemic and application-specific issues through detailed analysis of observability data and collaborate proactively with development teams to prioritize feature requests and defect resolutions that enhance reliability.
- GCP Expertise:
Deep understanding of Google Cloud Platform services, specifically networking (VPC,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).