Staff Engineer,Engineering Productivity & AI Quality Job San Francisco area,California USA,Software Development

Staff Engineer, Engineering Productivity & AI Quality

Harper is an AI-native commercial insurance company, based in San Francisco and built from scratch. Most knowledge work is judgment locked inside people's heads — the exceptions, the precedents, the decision traces no one ever wrote down. Converting that judgment into software is one of the largest human-to-computational transitions still in front of us, and we think the most honest place to prove it is the hardest one: commercial insurance, a trillion-dollar industry that is still, even now, more than 90% done by hand.

We're not patching legacy workflows or adding a copilot to them. We're rebuilding the business so that AI does the work and people do the judgment that AI can't yet — and then teaching it that, too.

It's working: ~1,000 new customers a month and roughly 100x growth in the past year. That pace sets the culture. We're on‑site in San Francisco, in the building together, working long days to high standards — because a rebuild this large doesn't happen part‑time or by committee. Almost no one joins Harper because they're passionate about insurance. They join because they want to be on the frontier of the AI transition, doing the most consequential work of their career, in a company being built to define a category rather than join one.

If that's the work you're looking for, insurance is just where you get to do it.

The role

Every great AI company ends up building the same invisible machine: the harnesses, tests, instructions, and review loops that let a small team ship with impossible leverage. At Harper that machine is existential. Our agents write code, serve customers, assemble submissions, and make decisions that move revenue — and AI‑generated code volume has pulled the scaling problem forward. Even with a 20‑person engineering team, our coding agents create the surface area, review burden, and architectural drift of a 100‑person org.

If the rails are strong, twenty engineers operate like a hundred; if they're weak, velocity turns into drag and the CTO becomes the rail — which doesn't scale. This is the founding seat for that machine. You'll turn the CTO's taste into systems — PR preflight, integration tests, architecture rules, agent instructions, eval gates, the feedback loops every engineer feels daily — across three sub‑disciplines:
Harness Engineering (the meta‑harness over our frontier coding agents, Open Claw, Hermes, and internal agents), Developer Experience (CI/CD gates, build caching, merge queues, dev/staging/CI parity, the internal platform, eval infrastructure), and AI Quality (eval suite design, golden datasets, LLM‑as‑judge graders, production trajectory monitoring, drift detection, anti‑slop guardrails). The mission is simple: make the right way the easy way, and make Harper's engineering org compound with every ship.

What you'll own

CI/CD quality gates across Harper's most critical services — the minimum bar before code can merge.
Integration test harnesses anchored to real failure modes — every repeated operational failure becomes a regression test, a validation, or an architecture rule.
The agent harness substrate — sandbox lifecycle, tool routing, prompt/context layer, model‑provider abstraction, multi‑agent coordination.
Repo‑level agent instructions and context hygiene — AGENTS.md per repo, canonical data‑model docs, banned patterns. The information environment our coding agents read.
Automated PR preflight — service‑impact summary, tests run, missing tests, model/migration changes, critical‑path warnings. The robot that reviews every PR before a human does.
Architecture‑rule enforcement — custom lints and structural tests that encode the CTO's taste mechanically. Once a rule is written down, it never gets argued in PR comments again.
Eval framework infrastructure — pre‑merge eval gating, experiments against curated datasets, production trajectory monitoring, all wired together.
Engineering metrics that matter — rework rate, escaped defects, flaky‑test count, deploy rollbacks, time‑to‑confident‑ship, AI‑generated PR quality. Anti‑vanity, anti‑LOC.

What we're looking for

8+ years building software, including…