Devops Engineer Job Harrow area,England UK,IT/Tech

Overview

Job title: Devops Engineer

Location: London; full in-office working as default

Start date: ASAP

Reports to: CTO

Compensation: £60 - 90k + Equity

Cosine at a glance:
At Cosine, we’re building autonomous AI engineers that plan, write, and ship code inside real development workflows. Cosine is designed for on-premise and virtual private cloud (VPC) deployments, including fully air-gapped environments. We build our agent tooling entirely in-house and post-train open-source models to deliver reliable, enterprise-grade coding performance in security-critical settings. In 2024, Cosine achieved a 72% score on OpenAI’s SWE-Lancer benchmark, placing us among the strongest real-world software-engineering AI systems evaluated.

YC-backed and well-funded, Cosine was founded by experienced operators focused on building dependable, production-grade AI.

This role is based in our Hoxton office, five days a week, because close collaboration, fast feedback, and shared context matter for the problems we’re solving.

The role

We’re looking for a Devops / Senior Platform / Infra Engineer to own the core infrastructure that powers Cosine’s products — from Kubernetes and deployment pipelines to networking and platform services.

You’ll design and run the “paved road” that our engineers, researchers, and customers build on: reliable Kubernetes clusters, fast and safe CI/CD, solid observability, and hardened environments for demanding enterprise and on-prem deployments. You’ll also wear a classic “Dev Ops/SRE” hat: thinking in SLOs, running incident response, and keeping us up even as we move quickly.

This is a high-ownership role at a fast-paced, venture-backed Silicon Valley startup. You’ll work directly with founding engineers and leadership, and your decisions will materially shape how we build and ship products.

What You’ll Do

Own core infrastructure
- Design, operate, and evolve our Kubernetes-based platform (EKS or similar), including cluster topology, node groups, autoscaling, and multi-environment isolation.
- Manage supporting cloud resources: container registries, load balancers, queues, caches, and data infra needed to run our APIs and agents.
Build the deployment & tooling layer
- Design and maintain CI/CD pipelines for image builds and infra rollouts (e.g. Pulumi/Terraform + Helm/Docker).
- Implement safe rollout strategies (blue/green, canary, staged rollouts) and fast rollback paths.
- Build internal tools and abstractions that make it easy for product teams to self-serve infra safely.
Own reliability & operations (SRE-ish)
- Define and track SLOs/SLIs for key services (latency, error rates, availability).
- Improve our observability stack (metrics, logs, traces, alerts) so issues are obvious, actionable, and debuggable.
- Participate in the on-call rotation, lead incident response when needed, and drive blameless post-mortems and fixes.
Shape networking & security
- Design and maintain networking: VPCs, subnets, ingress/egress, service meshes / L7 routing, DNS, and TLS.
- Implement least-privilege access via IAM, secure secret management, and hardened configurations for multi-tenant and isolated customer environments.
- Help design patterns for secure enterprise and on-prem / regulated deployments.
Partner with product & research
- Work closely with application, ML, and research teams to understand their needs and translate them into reusable infra building blocks.
- Provide guidance on “how to run this in production” — capacity planning, failure modes, and operational readiness reviews.

What We’re Looking For

Have strong experience
- 5+ years building and operating production infrastructure on a major cloud (AWS, GCP, or Azure).
- Significant hands-on experience running Kubernetes in production (EKS/GKE/AKS or self-managed):
  - Cluster upgrades, autoscaling, node group design, and multi-env setups.
  - Helm or similar for packaging services.

Think in infrastructure-as-code
- Deep experience with IaC tools (Pulumi, Terraform, CDK, or similar).
- Comfortable managing infra changes via code review, CI, and automated rollouts.

Care deeply about reliability
- Have owned the uptime and performance of user-facing systems.
- Comfortable participating in (and improving) on-call…


Increase/decrease your Search Radius (miles)



Job Posting Language