Software Engineer, Agentic Platform; Seattle, WA
Listed on 2026-06-02
-
Software Development
Cloud Engineer - Software, DevOps, Software Engineer
Docker has been one of the most loved brands in developer tooling, trusted by more than 20 million monthly users and over 20 billion container image pulls.
We are a globally distributed, remote-first team building the tools that define how software gets built and delivered. As AI agents redefine software development, Docker is at the center of that shift, providing the sandboxed environments, verified images, and secure infrastructure that make autonomous workflows trustworthy by default.
Join Docker's Agentic Platform team to build the foundational infrastructure powering the next generation of AI-driven workflows. Intelligent agents are rapidly becoming the primary interface between developers and complex systems and we're building the platform that makes them reliable, scalable, and observable at production scale.
You'll be working on the core agent execution runtime, orchestration primitives, and the cloud infrastructure that keeps the Agentic Platform running 24/7. This is a high-ownership role: you won't just build systems, you'll run them, respond when they fail, and drive continuous improvement across the stack.
This is a greenfield opportunity to shape how agents are built and operated 'll work alongside seasoned engineers, collaborating with partner teams across AI infrastructure, developer experience, and platform reliability.
Please note:
for this role, we are prioritizing candidates who currently live in Seattle, WA Metro Area.
Agent Workflow & Orchestration
Design and operate the core agent execution runtime responsible for scheduling, state management, and lifecycle management of long-running agentic workflows
Build robust multi-agent coordination patterns: task handoff, agent memory (short-term and long-term), tool use, and workflow branching at scale
Develop context window management strategies and session persistence layers for stateful agent interactions
Build tooling for prompt engineering as a first-class engineering discipline — versioning, testing, and evaluation of prompts at scale
Build platform capabilities that support developers working in AI-assisted coding workflows, including IDE integrations, local-first development environments, and fast iteration loops
Own and operate Agentic Platform services in AWS or OCI infrastructure provisioning, scaling, cost management, and reliability
Provision and manage cloud infrastructure using Terraform; manage Kubernetes application packaging and deployment with Helm
Participate in the 24/7 on-call rotation
This role may require participation in a 24/7 on-call rotation for the Agentic Platform; carry genuine pager responsibility for the services you build and operate
Define and uphold SLOs; lead incident response, blameless post-mortems, and drive continuous reliability improvements
Instrument systems for observability: distributed tracing, structured logging, metrics dashboards, and alerting
As a Staff Engineer, partner with engineering leadership to set technical direction and serve as a guide and mentor as the team grows
Drive architectural decisions that balance velocity with long-term maintainability across a distributed, cloud-native stack
Collaborate cross-functionally with product managers, designers, and partner engineering teams to integrate agentic capabilities into the broader developer platform
Contribute to a culture of engineering excellence through design reviews, RFC processes, and mentorship
Required:
8+ years of professional, hands-on, full-time software engineering experience in backend, infrastructure, or platform engineering.
Cloud Platform Expertise (AWS/OCI/Azure/GCP): Proven, hands-on experience operating production services in AWS or Oracle Cloud Infrastructure compute, networking, managed services, IAM, and cost management. This is a must-have; the Agentic Platform is a cloud-native service running 24/7.
Service Ownership in a Cloud Setting: You have owned production services end-to-end — on-call, incident response, SLO definition, and post-mortems. You don't just build; you run what you build.
Distri…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).