Site Reliability Engineer Job Reston area,Virginia USA,IT/Tech

Position: Staff Site Reliability Engineer
Trans Union's Job Applicant Privacy Notice

Personal Information We Collect

Your Privacy Choices

Team Overview

At Trans Union, this role will report to a Dev Ops Director. The Site Reliability Engineering team drives reliability strategy, elevates engineering standards, and owns some of the most complex and consequential work on the platform.

This is a hybrid position and involves regular performance of job responsibilities virtually as well as in-person at an assigned TU office location for a minimum of two days a week.

Role Overview and Core Responsibilities

As a Staff Site Reliability Engineer at Trans Union, you will serve as a senior technical leader and force multiplier on the SRE team. Operating with full autonomy, you will drive reliability strategy, lead high-risk technical initiatives, and set the engineering standards that elevate the entire team. You'll bring deep expertise across GCP, Kubernetes, CI/CD pipelines, and monitoring platforms - contributing to strategic decisions on major platform components while fully participating in on-call rotation.

Whether stepping in to lead the team, owning complex capacity and security work, or anchoring incident response with calm and maturity, your impact will be felt across the platform and the people around you.

Technical Leadership & Strategic Influence

* Recognized expert across multiple systems; actively contributes to architectural and strategic decisions around major platform components.

* Leads research, testing, implementation, and continuous improvement for new systems and tooling.

* Performs complex, high-impact work including capacity planning, load testing, and security improvements.

Operational Excellence & On-Call

* Fully participates in the team's on-call rotation; models calm, effective, and blameless incident response.

* Serves as a significant technical contributor during major incidents and problem resolution.

* Plans and leads high-risk maintenance events with minimal to no customer impact.

Standards & Team Elevation

* Elevates team standards through new tooling, processes, procedures, and effective communication.

* Capable of stepping in to lead and represent the team - a trusted resource during transitions or coverage gaps.

* Sets new professional benchmarks in technical quality, engineering culture, and cross-functional collaboration.

Required Knowledge and Experiences

* 5+ years of experience in Cloud Architecture, Site Reliability Engineering, Platform Engineering, or related fields - with a proven track record of designing and delivering at enterprise scale.

* Architectural authority - you don't just contribute to technical decisions, you drive them. You've owned the design of large-scale, mission-critical systems from whiteboard to production.

* Deep, hands-on expertise with Google Cloud Platform (GCP) and Kubernetes (K8s) - running high-volume, high-availability workloads with 99.999% reliability targets.

* Mastery of CI/CD pipeline architecture - designing end-to-end delivery systems that are fast, safe, and built for scale.

* Expert-level command of monitoring, observability, and alerting platforms (e.g., Datadog, Prometheus, Grafana, Pager Duty) - you define what good looks like.

* Deep Linux expertise - from kernel internals and system performance tuning to hardening and troubleshooting at the OS level in production environments.

* Strong command of database architecture - including relational (Postgre

SQL, MySQL, Cloud SQL) and No

SQL (Bigtable, Firestore, Redis) systems, with experience designing for high availability, replication, failover, and performance at scale.

* Advanced networking knowledge - including VPCs, subnets, DNS, load balancing, firewall rules, VPNs, private service connect, and hybrid connectivity patterns across cloud and on-prem environments.

* Proven expertise in Infrastructure-as-Code (IaC) - designing and enforcing scalable, reusable frameworks using Terraform, Pulumi, or equivalent tools.

* Strong proficiency in scripting and automation (e.g., Python, Bash, Go) - building the tools and workflows that eliminate toil and accelerate delivery.

* Deep understanding of security architecture - including identity and…