Senior Site Reliability Engineer
Listed on 2026-06-22
-
IT/Tech
SRE/Site Reliability, Systems Engineer, Cloud Computing: Infrastructure & Operations
Your Opportunity
At Schwab, you’re empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us challenge the status quo and transform the finance industry together.
We believe in the importance of in-office collaboration and fully intend for the selected candidate for this role to work on site in the specified location(s).
As a Senior Site Reliability Engineer within the CETSAvE organization, you will play a critical leadership role advancing the reliability, scalability, and performance of Schwab’s mobile and digital platforms. You will lead efforts to elevate production operations through modern Site Reliability Engineering practices, shaping how engineering teams design, build, and operate resilient systems at scale.
In this role, you will drive measurable improvements in service health and client experience by defining and executing strategies that enhance observability, automation, and system resilience. You will partner cross-functionally with engineering, architecture, infrastructure, and product teams to embed reliability, scalability, and operational excellence into the full software development lifecycle.
Success in this role requires strong problem-solving and decision-making, particularly in complex, high-scale distributed environments. You will influence technical direction, introduce best practices such as service level objectives and error budgets, and guide teams in reducing operational toil through automation and tooling innovation. Your leadership will ensure teams are aligned on reliability goals, respond effectively to production challenges, and continuously improve systems through learning and adaptation.
You will also play a key role in evolving operational maturity by strengthening on‑call practices, enabling faster detection and resolution of issues, and fostering a culture of accountability, collaboration, and continuous improvement. This is an opportunity to shape enterprise‑wide engineering standards while developing high‑performing teams and advancing modern reliability engineering capabilities.
Key Responsibilities Production Operations & Incident Management- Respond to system alerts and production incident escalations
- Lead or support incident triage, resolution, and root cause analysis
- Drive and contribute to post-incident reviews and continuous improvement actions
- Participate in an on-call rotation to support high-availability systems
- Ensure comprehensive monitoring coverage and effective alerting strategies across systems
- Continuously improve visibility into system performance, reliability, and health
- Define and evolve observability best practices, including telemetry, dashboards, and alert thresholds
- Design and build automation solutions to reduce operational toil and improve resiliency
- Develop scripts and tooling using Python and shell scripting for system maintenance and performance optimization
- Contribute to CI/CD and deployment pipeline improvements
- Automate processes such as service recovery, system maintenance, and certificate management
- Partner with development teams to understand system changes and ensure production readiness
- Establish guardrails for monitoring, alerting, and escalation procedures
- Embed reliability practices into the software development lifecycle
- Proactively identify system weaknesses, risks, and performance gaps
- Drive improvements in system reliability, scalability, and resilience
Implement and evolve SRE best practices (SLOs, error budgets, incident reduction strategies)
- Explore the use of AI and automation to improve incident detection, triage, and response
- Identify opportunities to enhance response times and reduce manual intervention
- Mentor and support junior engineers in SRE best practices and automation techniques
- Influence engineering teams to adopt proactive reliability and observability practices
- Promote a culture of curiosity, ownership, and continuous improvement
To ensure that we…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).