Platform Engineer Job San Francisco area,California USA,Software Development

Location: San Francisco, CA on site

Compensation: $200,000 to $250,000 base salary, plus bonus and equity

Overview

Shields Group Search is partnering with a fast-growing, Series A AI infrastructure company building the connective layer between AI agents and the tools people use every day, including Git Hub, Gmail, Notion, Salesforce, and more.

The company is building core infrastructure that allows agents to safely and reliably communicate with external tools, execute workflows, manage authentication, run code, trigger actions, and operate across real‑world software environments.

They recently raised a $25M Series A from top‑tier investors and have seen rapid revenue growth, with customers ranging from early AI‑native startups to major technology companies.

This role is for a hands‑on Site Reliability Engineer / Platform Engineer who can help scale, harden, and own the company’s infrastructure as usage grows. The team is looking for someone with real production experience managing cloud infrastructure, reliability, observability, deployment systems, and high‑availability backend services.

This is an individual contributor role. Management experience is not required.

The ideal candidate has hands‑on experience across SRE, Dev Ops, backend engineering, infrastructure engineering, cloud platforms, distributed systems, and performance optimization. They should be comfortable owning infrastructure in a fast‑moving startup environment and should have evidence that they build, experiment, and go deep outside of assigned work.

What You’ll Do

Own reliability, scalability, observability, and performance across core production infrastructure
Manage and improve infrastructure across cloud platforms such as AWS, Vercel, and related systems
Build and improve the platform infrastructure supporting AI agent workflows, tool execution, authentication, triggers, APIs, sandboxes, and runtime orchestration
Design and operate reliable backend systems that interact with many third‑party tools and APIs
Improve infrastructure supporting high‑throughput, distributed, cloud‑native services
Work across cloud infrastructure, Linux systems, containers, deployment pipelines, service orchestration, CI/CD, and observability tooling
Build automation that reduces operational burden and improves incident response
Develop internal productivity tooling, runbooks, monitoring, alerting, dashboards, and reliability workflows
Debug complex production issues across application, infrastructure, network, database, deployment, and runtime layers
Improve system performance through tracing, profiling, database query optimization, workflow optimization, CPU/heap profiling, and deep root‑cause analysis
Help manage and improve multiple execution environments, including serverless runtimes, sandboxed code execution, and related backend systems
Partner closely with product engineers and customers to support important workloads and improve the platform in the process
Write clear documentation that explains complex systems, operational patterns, and infrastructure decisions
Help define the reliability culture, infrastructure standards, and technical bar for a small, high‑craft engineering team

What They’re Looking For

4+ years of software engineering, site reliability engineering, infrastructure engineering, Dev Ops, platform engineering, or distributed systems experience preferred, but not a hard requirement for exceptional candidates
Hands‑on experience managing production infrastructure across cloud environments
Experience with AWS, Vercel, Kubernetes, Linux, containers, deployment systems, observability tools, or similar infrastructure
Strong backend engineering fundamentals and ability to write production‑quality code
Experience with monitoring, tracing, logging, alerting, incident response, and system performance
Experience scaling and operating distributed systems, microservices, APIs, databases, queues, or high‑throughput backend services
Ability to debug hard production issues across many layers of the stack
Strong systems thinking and ability to understand how infrastructure, application code, databases, deployments, and customer‑facing workflows interact
Ability…