Asset & Wealth Management - Site Reliability Engineer - Vice President - Richardson Job Richardson area,Texas USA,IT/Tech

Site Reliability Engineer - Vice President

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run scalable, massively distributed, fault‑tolerant systems. At Goldman Sachs, SRE is responsible for improving the availability and reliability of the firm’s most critical platform services and ensures they meet the requirements of our internal and external users. It is also responsible for firmwide policies and standards focused on firm’s digital resilience.

We are looking for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast‑paced, global business environment.

Role Overview

As a Site Reliability Engineer (SRE) at Goldman Sachs, you will be a pivotal leader in ensuring the availability, reliability, and scalability of the firm's most critical platform applications and services. You will combine deep software and systems engineering expertise to architect, build, and run large‑scale, massively distributed, fault‑tolerant systems. This role involves providing technical leadership, mentoring senior engineers, and collaborating closely with internal teams and executive stakeholders to build and operate sustainable production systems that can adapt to our dynamic global business environment.

You will drive a culture of continuous improvement, championing the adoption of advanced SRE principles and best practices across the organization.

Responsibilities

Strategic Reliability & Performance:
Drive the strategic direction for availability, scalability, and performance of mission‑critical applications and platform services, ensuring alignment with firm‑wide objectives.
Architectural Leadership:
Lead the design, build, and implementation of highly available, resilient, and scalable infrastructure and application architectures.
Advanced Automation & Tooling:
Architect and develop sophisticated platforms, tools, and automation solutions to eliminate toil, optimize operational workflows, and enhance deployment processes across the enterprise.
Complex Incident Management & Post‑Mortem Analysis:
Lead critical incident response, conduct in‑depth root cause analysis for systemic issues, and implement long‑term preventative measures to significantly enhance system stability and resilience.
System Design & Capacity Planning:
Partner with development teams to embed reliability into application design from inception, provide expert system design consulting, and lead comprehensive capacity‑planning initiatives for future growth.
Observability & Insights:
Define and implement advanced monitoring, high‑volume logging with multi‑user query capabilities, and tracing strategies to provide deep, actionable insights into application performance, infrastructure health, and user experience.
Technical Vision & Mentorship:
Provide technical vision, lead complex technical projects, conduct rigorous code reviews, enforce SDLC best practices, and actively mentor and develop senior and staff‑level engineers.
Technology Evaluation & Adoption:
Stay at the forefront of industry trends and advancements, evaluating and integrating cutting‑edge tools and frameworks to significantly improve operational efficiency and reliability.
On‑Call Leadership:
Participate in and lead on‑call rotations, providing expert guidance and hands‑on support for critical system incidents.

Qualifications

Experience:

Minimum of 6+ years of hands‑on experience in Site Reliability Engineering, with a proven track record in architecting, designing, building, and maintaining highly available, scalable, and fault‑tolerant systems at an enterprise level.
Technical Proficiency:
- Exceptional programming skills in one or more major languages such as Java, Python, Go with a focus on building robust, scalable software.
- Extensive hands‑on experience with cloud platforms (e.g., AWS, GCP) and deep expertise in containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Mastery of Infrastructure as Code (IaC) tools (e.g., Terraform, Cloud Formation) and configuration management tools (e.g., Puppet,…