Sr Site Reliability Engineer Job Austin area,Texas USA,IT/Tech

Sr Site Reliability Engineer Join us on our mission to empower more people to find their way home by breaking barriers to entry, making the right connections, and building confidence through expert guidance.
We are seeking a Senior Site Reliability Engineer to join our newly formed Operations Excellence organization, reporting to the Director, Operations Excellence. This role will contribute to the reliability, observability, and operational excellence of our platform infrastructure serving millions of users. As a Senior SRE, you will be a strong technical contributor who implements best practices, solves complex problems, and enables our 600+ engineers to deliver exceptional customer experiences.

You will work on critical platform systems including EKS infrastructure, Skyway (CI/CD), Frontdoor (Tyk API Gateway), Pantheon (Apollo GraphQL Federation), and our observability stack, while contributing to chaos engineering practices and cost optimization initiatives with measurable ROI.
What You'll Do:
Platform Reliability & Infrastructure
Implement and maintain highly available AWS infrastructure including EKS clusters, Fargate (ECS), and multi-region architectures
Support reliability of critical services:
Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo GraphQL), and supporting infrastructure
Monitor SLIs, SLOs, and error budgets for Tier 1/2/3 systems; participate in architectural reviews for reliability and cost-efficiency
Implement reliability patterns including circuit breakers, graceful degradation, and automated failover
Observability & Cost Optimization
Implement observability solutions using New Relic for APM, distributed tracing, metrics, and logging for rapid troubleshooting
Build dashboards and alerts that reduce MTTD and MTTR; contribute to observability standards across teams
Identify infrastructure cost optimization opportunities and implement Fin Ops practices including rightsizing and resource lifecycle management
Support cost-conscious architecture decisions and CI/CD spend optimization (CircleCI, Argo CD)
Chaos Engineering & Incident Response
Execute chaos engineering experiments to identify system weaknesses; contribute to frameworks for safe production testing
Participate in game day exercises and disaster recovery simulations; create runbooks and automation for resilience
Participate in on-call rotation for critical systems; conduct post-incident reviews and implement improvements
Support incident response processes and contribute to System Health Scorecard
Technical Contribution
Contribute as a strong technical individual contributor to the Operations Excellence team
Collaborate with Platform Engineering, Quality Engineering, and product teams on reliability initiatives
Support security initiatives including AWS Secrets Manager migration and compliance requirements (SOC 2, PCI, GDPR)
Contribute to Developer Experience metrics and platform adoption goals
May provide technical guidance to junior team members
What You'll Bring:
5+ years in Site Reliability Engineering, Dev Ops, or Infrastructure Engineering with demonstrated success improving system reliability
Bachelor's degree or equivalent experience
3+ years hands-on experience with AWS (EKS, EC2, RDS, S3, Cloud Watch, IAM) and Kubernetes including cluster management
Proficient programming skills (Python, Go, or Java) with infrastructure automation and Infrastructure as Code experience (Terraform, Cloud Formation)
Production experience with observability tools (New Relic, Datadog, Prometheus, Grafana, Splunk) and distributed systems

Experience with CI/CD platforms and Git Ops workflows (CircleCI, Argo CD, Jenkins); on-call rotation and incident response
Preferred:
Exposure to chaos engineering tools, API Gateway technologies (Tyk/Kong), GraphQL federation (Apollo), cost optimization initiatives, Fin Ops principles
Technical Skills
Cloud &

Infrastructure: AWS (EKS, Fargate, Lambda, VPC, Route
53, Cloud Front), Kubernetes, Docker, Istio Service Mesh
CI/CD & Git Ops:
Argo CD, CircleCI, Jenkins, Git Hub Actions
Observability:
New Relic APM, distributed tracing, metrics & logging;
Splunk logging
IaC & Automation:
Terraform, Cloud Formation, Helm, Kustomize, Python/Go/Bash
Platform Services:
Tyk Gateway, Apollo GraphQL, AWS Secrets Manager, Vault
Incident Management:
Ops Genie, Pager Duty, Service Now
Professional Qualities
Strong communication skills with ability to explain technical concepts to diverse audiences
Collaborative approach working across engineering, product, and business teams
Self-motivated with ability to solve complex problems within established practices and policies
Data-driven decision making with customer-centric approach and empathy for developer experience
How We Work:
We balance creativity and innovation on a foundation of in-person collaboration. For most roles, our employees work three or more days in our offices, where they have the opportunity to collaborate in-person, adding richness to our culture and knitting us closer together.
How We Reward You:
is…