More jobs:
Sr. Engineer - SRE - AI
Job in
Detroit, Wayne County, Michigan, 48228, USA
Listed on 2026-06-27
Listing for:
Ally
Full Time
position Listed on 2026-06-27
Job specializations:
-
IT/Tech
SRE/Site Reliability, AI Engineer (Applied/Software), Cloud Computing: Infrastructure & Operations
Job Description & How to Apply Below
Career area
Technology
Work Location(s)500 Woodward Avenue, MI, 601 S. Tryon Street, NCRemote?
No
Ref #22455
Posted Date
06-24-26
Working time
Full time### Ally and Your Career Ally Financial only succeeds when its people do - and that’s more than some cliché people put on job postings. We live this stuff! We see our people as, well, people - with interests, families, friends, dreams, and causes that are all important to them. Our focus is on the health and safety of our teammates as well as work-life balance and diversity and inclusion.
From generous benefits to a variety of employee resource groups, we strive to build paths that encourage employees to stretch themselves professionally. We want to help you grow, develop, and learn new things. You’re constantly evolving, so shouldn’t your opportunities be, too?
Work Schedule:
Ally designates roles as (1) fully on-site, (2) hybrid, or (3) fully remote. Hybrid roles are generally expected to be in the office a certain number of days per week as indicated by your manager. Your hiring manager will discuss this role's specific work requirements with you during the hiring process. All work requirements are subject to change at any time based on leader discretion and/or business need.###
The Opportunity At Ally, you get a startup feel, but experience the benefits of a company that has worked out the kinks and is fulfilling its purpose. We are always evolving and see that as a good thing. From owning our work to seeing its impact in the real world, our team is relentless in finding new ways technology can help make experiences better and help people.
We are problem solvers, we value diverse thinking, we support one another, and we challenge ourselves to think bigger in the journey to deliver customer-obsessed tech solutions. To read more about what our tech team does, be sure to visit our tech blog hYou will bring SRE practice to the AI agent ecosystem — defining and enforcing production readiness standards, building the observability and alerting layer, running the readiness gate for every service before it goes live, and owning the incident response process when things fail.
You will also build and operate the SRE agents themselves: the automated tools that run production readiness checks, generate post-incident reviews, monitor SLO burn rates, and surface reliability findings before a deployment proceeds.
This is not a traditional SRE role watching dashboards. You are an active builder. The SRE toolchain here is itself an agent-driven system — you will extend it, maintain its knowledge core content, and use it to enforce standards across every team in the program.
At this time, Ally will not sponsor a new applicant for employment authorization for this position.### The Work Itself Production readiness and SLO ownership
* Run the 10-point production readiness gate for every Lightspeed and Logos service before first production deploy — SLOs defined, runbook exists, alerting configured, rollback documented, on-call assigned
* Define and maintain Dynatrace SLOs for AI-powered services; configure burn-rate alerting (multi-window, aligned to user impact)
* Own the error budget policy: track consumption, flag services approaching exhaustion, enforce the deployment freeze when budgets are gone Observability for AI workloads
* Instrument AI agent pipelines with structured JSON logging (traceId, spanId, correlationId), custom metrics, and distributed traces
* Build Dynatrace dashboards for AI services: request rate, error rate, latency P50/P95/P99, dependency health, agent invocation counts and failure rates
* Identify and address the observability gaps that make AI system failures hard to diagnose — context truncation, tool call failures, model timeouts, partial completionsSRE agent development and maintenance
* Own the sre-gate, sre-monitor, sre-pir, sre-remediation, and sre-validation agents — keep their behavioral rules, domain context, and integration patterns current
* Maintain the SLO definitions, domain team map, incident classification, and runbook location config in the knowledge core (domain.md, architecture.md)
*…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×