Senior Staff Engineer - Agent Platform Job Caerphilly area,Wales UK,Software Development

🏅 The Role

We are looking for a Senior Staff Software Engineer to help shape and scale our Agentic layer s role combines deep hands-on engineering with strategic technical ownership and organisational influence.

You will build the foundations that make AI agents production-ready at tem: the core runtime and tooling, the integration interfaces to our systems, and the engineering standards that let teams ship agentic capabilities safely, reliably, and repeatedly.

You will work end-to-end from early pilots to production roll-outs
, partnering closely across engineering, product, data, and domain teams to translate real workflows into durable agent-powered systems. This role requires strong cross-team influence, the ability to align technical design with product and business outcomes, and the judgment to balance rapid delivery with long-term system integrity.

🚀 Responsibilities

Ship flagship agentic capabilities: Deliver high-impact agentic workflows end-to-end, from discovery through production roll-out, with clear success metrics and fast iteration loops.
Build and operate production-grade agent systems: Design reliable agentic systems that behave predictably under real-world constraints, including latency, cost, data quality, and failure modes, with strong patterns for state management, idempotency, and safe recovery.
Create shared foundations for agent delivery: Develop the core primitives that enable teams to build agents consistently (runtime patterns, tool interfaces, context management, shared libraries) while avoiding one-off implementations.
Establish a pragmatic Agent Development Life Cycle (ADLC): Implement evaluations, guardrails, tracing, monitoring, and release processes so agents can be measured, debugged, and improved continuously.
Integrate ML and LLM components into production workflows: Work with ML/Data teams to product ionise models and LLM capabilities with clear contracts, versioning, observability, and safe degradation patterns.
Maintain clear domain boundaries as adoption scales: Define shared semantics for agent tools and data access, preventing domain drift while enabling teams to move quickly.
Collaborate with Platform on infrastructure and developer tooling: Adopt and extend existing CI/CD, Dev Ex, and observability systems, contributing back where agentic workloads introduce new requirements.

Success measures

3 months: Ship first flagship agentic workflow to production with defined KPI, runbook/on-call ownership, and baseline telemetry (success rate, latency, cost).
6 months: Ship additional workflows or expansions and implement lightweight ADLC: evals + guardrails + monitored rollouts + rollback.
12 months: Prove repeatable capability: 2+ product teams shipping on shared foundations, faster time-to-prod for new agents, and reliability/cost targets consistently met.

🎯 Requirements

Must-Haves:

Architectural depth: Proven ability to design and evolve complex, stateful distributed systems spanning APIs, event-driven architectures, data systems, and agentic applications - where domain logic is the primary source of complexity. Proven patterns for high-throughput performance and scaling architecture to support hundreds of thousands of customers, while preventing domain drift.
Proven experience building AI agents in production
, not just demos, with a clear understanding of current best practices (agent architectures, tool calling, RAG where appropriate, prompt and context engineering). Ability to run AI/agentic systems reliably in production with observability, incident readiness, and cost controls.
Deep experience with:
- AWS serverless architecture (Lambda, API Gateway, Event Bridge, Step Functions)
- Event-driven systems and asynchronous workflows
Strong coding skills: deep hands-on experience with a variety of coding languages, and comfortable with a tech-agnostic approach. Familiarity with Python is a must-have.
Agent quality discipline: hands-on experience with evaluations (offline and online), regression testing, safety guardrails, and monitoring for reliability, cost, and drift.
Strong backend and distributed-systems fundamentals: APIs, asynchronous workflows, state management, idempotency,…