Site Reliability Engineer
Listed on 2026-05-30
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Team Overview
At Trans Union, this role will report to a Dev Ops Director. The Site Reliability Engineering team drives reliability strategy, elevates engineering standards, and owns some of the most complex and consequential work on the platform.
As a Staff Site Reliability Engineer at Trans Union, you will serve as a senior technical leader and force multiplier on the SRE team. Operating with full autonomy, you will drive reliability strategy, lead high‑risk technical initiatives, and set the engineering standards that elevate the entire team. You’ll bring deep expertise across GCP, Kubernetes, CI/CD pipelines, and monitoring platforms — contributing to strategic decisions on major platform components while fully participating in on‑call rotation.
Whether stepping in to lead the team, owning complex capacity and security work, or anchoring incident response with calm and maturity, your impact will be felt across the platform and the people around you.
This is a hybrid position and involves regular performance of job responsibilities virtually as well as in‑person at an assigned TU office location for a minimum of two days a week.
Role Overview and Core Responsibilities Technical Leadership & Strategic Influence- Recognized expert across multiple systems; actively contributes to architectural and strategic decisions around major platform components.
- Leads research, testing, implementation, and continuous improvement for new systems and tooling.
- Performs complex, high‑impact work including capacity planning, load testing, and security improvements.
- Fully participates in the team’s on‑call rotation; models calm, effective, and blameless incident response.
- Serves as a significant technical contributor during major incidents and problem resolution.
- Plans and leads high‑risk maintenance events with minimal to no customer impact.
- Elevates team standards through new tooling, processes, procedures, and effective communication.
- Capable of stepping in to lead and represent the team — a trusted resource during transitions or coverage gaps.
- Sets new professional benchmarks in technical quality, engineering culture, and cross‑functional collaboration.
- 5+ years of experience in Cloud Architecture, Site Reliability Engineering, Platform Engineering, or related fields — with a proven track record of designing and delivering at enterprise scale.
- Deep, hands‑on expertise with Google Cloud Platform (GCP) and Kubernetes (K8s) — running high‑volume, high‑availability workloads with 99.999% reliability targets.
- Mastery of CI/CD pipeline architecture — designing end‑to‑end delivery systems that are fast, safe, and built for scale.
- Expert‑level command of monitoring, observability, and alerting platforms (e.g., Datadog, Prometheus, Grafana, Pager Duty) — you define what good looks like.
- Deep Linux expertise — from kernel internals and system performance tuning to hardening and troubleshooting at the OS level in production environments.
- Strong command of database architecture — including relational (Postgre
SQL, MySQL, Cloud SQL) and No
SQL (Bigtable, Firestore, Redis) systems, with experience designing for high availability, replication, failover, and performance at scale. - Advanced networking knowledge — including VPCs, subnets, DNS, load balancing, firewall rules, VPNs, private service connect, and hybrid connectivity patterns across cloud and on‑prem environments.
- Proven expertise in Infrastructure‑as‑Code (IaC) — designing and enforcing scalable, reusable frameworks using Terraform, Pulumi, or equivalent tools.
- Strong proficiency in scripting and automation (e.g., Python, Bash, Go) — building the tools and workflows that eliminate toil and accelerate delivery.
- Hands‑on experience designing and integrating AI/ML‑powered solutions into cloud‑native platforms — including familiarity with LLM orchestration, vector databases, model serving infrastructure, and AI observability — with the ability to evaluate emerging tools and translate them into reliable, production‑grade capabilities.
At Trans Union, we…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).