Remote Senior SRE: Scale, Reliability & Automation Leader
Whitefish, Flathead County, Montana, 59937, USA
Listed on 2026-06-04
-
IT/Tech
SRE/Site Reliability
Company Overview
Docusign brings agreements to life. Over 1.5 million customers and more than a billion people in over 180 countries use Docusign solutions to accelerate the process of doing business and simplify people’s lives. With intelligent agreement management, Docusign unleashes business-critical data that is trapped inside of documents. Until now, these were disconnected from business systems of record, costing businesses time, money, and opportunity.
Using Docusign’s Intelligent Agreement Management platform, companies can create, commit, and manage agreements with solutions created by the #1 company in e-signature and contract lifecycle management (CLM).
Company Overview
Docusign brings agreements to life. Over 1.5 million customers and more than a billion people in over 180 countries use Docusign solutions to accelerate the process of doing business and simplify people’s lives. With intelligent agreement management, Docusign unleashes business-critical data that is trapped inside of documents. Until now, these were disconnected from business systems of record, costing businesses time, money, and opportunity.
Using Docusign’s Intelligent Agreement Management platform, companies can create, commit, and manage agreements with solutions created by the #1 company in e-signature and contract lifecycle management (CLM).
What you'll do
We are looking for a self-motivated, driven and creative Senior Site Reliability Engineer to join the Site Reliability team. Metrics and analytics drive engineering at Docu Sign and ensure that we are dedicating valuable engineering cycles to the right places. This role is a unique opportunity to impact the entire Docu Sign team and drive adoption.
We are looking for a Senior Site Reliability Engineer (Senior SRE) to lead reliability initiatives for high‑impact services. In this role, you will own the reliability, scalability, and performance of one or more critical systems, lead the design and implementation of automation to eliminate toil and reduce operational risk, drive improvements in observability, incident response, and production readiness across teams and partner closely with product engineering, platform, security, and release management to ship changes safely and quickly.
Senior SREs at Docusign operate as hands‑on technical leaders: they set the reliability bar for their domain, mentor other engineers, and lead cross‑functional projects that materially improve availability and customer experience. Ideally, you have a background in software development, incident management, service catalogs, request tracing systems, time series telemetry platforms, application performance management tools or log management tools. The role requires an on-call rotation every 4 weeks.
This position is an individual contributor role reporting to the Senior Manager, SRE.
Responsibility
- Design, implement, and operate highly available, scalable services in cloud environments (primarily Azure, with some multi‑cloud scenarios)
- Define and evolve SLOs/SLIs, error budgets, and capacity strategies for owned services; use them to guide engineering trade‑offs and release decisions
- Analyze patterns in incidents and outages; own long‑term reliability improvements for your domain and contribute to reliability strategy across services
- Write high quality code that is easy to maintain and test
- Ensure design and architecture is extensible across projects, and participate in technical design and code reviews
- Identify operational toil and lead automation efforts to eliminate it—deployment, runbook, and remediation workflows that make incidents rarer and faster to resolve
- Develop robust, well‑tested tooling and shared libraries that are adopted across multiple teams
- Improve CI/CD pipelines and guardrails to reduce change failure rate while increasing deployment velocity
- Design and implement logging, metrics, tracing, and alerting for complex distributed systems; ensure signals are actionable and aligned to business impact
- Build and automate tools and solutions for incident impact analysis and effective mitigation
- Participate in and often lead incident response for Sev0–Sev2 events: triage, mitigation,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).