Engineering - SRE Platforms - Site Reliability Engineer - Vice President

Job Description

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run scalable, massively distributed, fault‑tolerant systems. At Goldman Sachs, the SRE team improves the availability and reliability of the firm’s most critical platform services and ensures they meet the requirements of internal and external users. The team develops and maintains platforms and tools, including central logging, monitoring, agents, alerting, capacity planning, operational readiness assessments, incident post‑mortems, SLIs/SLOs, and deployment automation.

Role Overview

As a Site Reliability Engineer at Goldman Sachs, you will provide technical leadership, mentoring, and cross‑functional collaboration to ensure the availability, reliability, and scalability of the firm’s critical platform applications and services.

Responsibilities

Strategic Reliability & Performance: Drive the strategic direction for availability, scalability, and performance of mission‑critical applications and platform services.
Architectural Leadership: Lead the design, build, and implementation of highly available, resilient, and scalable infrastructure and application architectures.
Advanced Automation & Tooling: Architect and develop sophisticated platforms, tools, and automation solutions to eliminate toil and enhance deployment processes.
Complex Incident Management & Post‑Mortem Analysis: Lead critical incident response, conduct root cause analysis, and implement preventative measures to improve system stability.
System Design & Capacity Planning: Partner with development teams to embed reliability into application design, provide expert system design consulting, and lead capacity planning initiatives.
Observability & Insights: Define and implement advanced monitoring, high‑volume logging, and tracing strategies to give deep, actionable insights.
Technical Vision & Mentorship: Provide technical vision, lead projects, conduct code reviews, enforce best practices, and mentor senior and staff engineers.
Technology Evaluation & Adoption: Evaluate and integrate cutting‑edge tools and frameworks to improve operational efficiency.
On‑Call Leadership: Participate in and lead on‑call rotations, providing expert guidance during incidents.

Qualifications

Experience: Minimum 6+ years of hands‑on SRE experience with proven track record in designing, building, and maintaining highly available, scalable, fault‑tolerant systems at enterprise level.
Technical Proficiency:
- Exceptional programming skills in Java, Python, or Go.
- Extensive experience with cloud platforms (AWS, GCP), containerization, and Kubernetes.
- Mastery of IaC tools (Terraform, Cloud Formation) and configuration management (Puppet, Chef, Ansible).
- Expertise in monitoring, alerting, logging, and tracing (Prometheus, Grafana, ELK, Datadog, Pager Duty).
- Deep knowledge of Linux internals, networking, distributed systems, and performance tuning.
- Experience with CI/CD tools (Jenkins, Git Lab, Maven).
- Strong foundation in databases and distributed systems.
- Exceptional problem‑solving and analytical skills.
Preferred Experience:
- Distributed databases such as Elastic Search.
- GCP Big Query.
- Messaging systems like Kafka.
Education: Advanced degree in Computer Science or related field, or equivalent practical experience.
Soft Skills: Superior communication, collaboration, and leadership skills; ability to influence technical direction, manage stakeholders, and drive change.

Job Info

Job Identification 153023
Job Category Vice President
Posting Date 04/13/2026, 02:28 PM
Location Dallas, Texas, United States

#J-18808-Ljbffr

Engineering - SRE Platforms - Site Reliability Engineer - Vice President - Dallas