Senior Site Reliability Engineer
Listed on 2026-06-21
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, AWS
About the Role
Project Graph is a new creative system that lets you combine first and third‑party AI models, Adobe tools, and custom interactive components inside a visual, designer‑friendly node graph editor. Connect a library of tactile, interactive components that bring creative control of AI back to your fingertips. Transform ideas into powerful workflows and modular tools that run anywhere, from the web to Adobe apps like Photoshop.
And if you’d rather focus on creating than building, you can tap into a growing ecosystem of community‑built tools without ever opening the editor.
The services team focuses on developing HTTP APIs for two primary uses: (1) managing Graph plugins, their versioning, publishing, access control and search; (2) async compute platform for running Graphs on the cloud. In this reliability‑focused role, you will own the availability, performance, and operability of these services—working closely with cluster orchestrators like Kubernetes and building on top of AWS cloud infrastructure.
You’ll partner with the backend engineers building these APIs to make sure the system stays fast, resilient, and cost‑effective as it scales.
- Define and enforce SLOs, SLIs, and error budgets for Project Graph’s HTTP APIs and async compute platform.
- Build and maintain observability—metrics, logging, tracing, and alerting—so issues are caught and diagnosed quickly.
- Lead incident response, run blameless post‑mortems, and drive the follow‑up work that prevents recurrence.
- Improve the reliability and scalability of an async job scheduling system built on top of Kubernetes and Postgres.
- Maintain and improve CI/CD systems to keep delivery fast, safe, and reliable.
- Own database data protection, backup, and resilience—including backup strategy, recovery testing, and disaster recovery planning.
- Design and implement cloud infrastructure and automation to meet reliability, performance, and cost goals.
- Reduce operational toil through tooling and automation, and partner with developers to build reliability in from the start.
- Participate in an on‑call rotation.
- Bachelor’s degree or equivalent experience in Computer Science.
- 5‑10 years of experience in site reliability engineering, infrastructure, or backend software development with a strong operational focus.
- Expertise with Kubernetes in production, including scaling, troubleshooting, and tuning.
- Expertise with Docker and containerization.
- Strong experience with bash and CI/CD tools, like CircleCI.
- Strong hands‑on experience in at least one server‑side language; we use Node.js/Type Script.
- Experience operating data stores such as Postgres, Redis, or similar in production; we run on AWS Aurora (Postgres‑compatible), so familiarity with managed/Aurora environments is a plus.
- Experience with database backup, resilience, and disaster recovery—designing backup strategies, testing recovery, and meeting RPO/RTO targets.
- Experience with Terraform and AWS.
- Hands‑on experience with observability tooling (metrics, logging, distributed tracing) and alerting.
- Familiarity with HTTP API security.
- A track record of incident response and a systematic, blameless approach to learning from failures.
- An interest in and ability to learn new technologies.
- Ability to tackle problems in a sustainable way, always striving to improve our processes and learn.
- Excellent verbal and written communication skills; can effectively articulate complex ideas and influence others through well‑reasoned explanations.
Our compensation reflects the cost of labor across several U.S. geographic markets, and we pay differently based on those defined markets. The U.S. pay range for this position is $159,200 – $301,600 annually. Pay within this range varies by work location and may also depend on job‑related knowledge, skills, and experience. Your recruiter can share more about the specific salary range for the job location during the hiring process.
In California, the pay range for this position is $208,300 – $301,600. In Washington, the pay range for this position is $190,200 – $275,400. In addition, certain roles may be eligible for long‑term incentives in…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).