HPC Engineer - Generative Biology Institute Job Oxford area,England UK,IT/Tech

Your Role

Working as part of a new Scientific Computing team within GBI, the HPC Engineer will help operate, improve, and scale the data and computing platform that will enable cutting‑edge research in engineering biology. This is a broad, hands‑on role at the interface of Linux systems, high‑performance computing, cloud infrastructure, Kubernetes, Slurm, storage, monitoring, and researcher support. They will help turn emerging researcher needs and operational lessons into robust platform improvements, reusable tooling, and clear runbooks.

This role is particularly suited to someone who enjoys practical systems work, learning new technologies, and collaborating closely with scientists and engineers. We do not expect candidates to have deep experience in every technology listed in this description. Instead, we are looking for a strong, scientifically minded systems engineer: someone who can troubleshoot complex environments, communicate clearly with multidisciplinary teams, learn unfamiliar tools quickly, and help build reliable, scalable services that advance GBI’s scientific mission.

Key Responsibilities

Operate, maintain, and improve GBI’s hybrid HPC platform, including Linux‑based compute environments, Slurm/Slinky workloads, Kubernetes/OKE services, Open OnDemand, GPU and CPU partitions, and shared storage
Help provision, configure, scale, and validate compute, storage, networking, and platform services using infrastructure as code, configuration management, and automation tools such as Terraform, Helm and Ansible
Monitor platform health, capacity, job scheduling, GPU utilisation, storage behaviour, and network performance; investigate issues using tools such as Prometheus and Grafana
Support researchers in using our Scientific Computing Platform, including triaging user issues and translating common pain points into platform improvements
Build and maintain reproducible runtime environments, container images, and workflow‑supporting services for scientific computing workloads, including bioinformatics, AI/ML, data processing, and simulation workflows
Contribute to safe rollout and maintenance processes for Slurm images, worker node pools, scheduler configuration, container runtime changes, security updates, and monitoring improvements
Create and maintain clear technical documentation, runbooks, validation checks, and issue/PR notes so the platform can be operated consistently and improved safely by the wider team

Requirements Essential Knowledge, Skills and Experience

Bachelor’s or Master’s degree in Computer Science, Computational Biology, Engineering, Physics, Mathematics, or a related discipline, or equivalent practical experience
Hands‑on experience supporting or administering Linux‑based systems in an HPC, cloud, research, academic, or production environment
Working knowledge of HPC or batch‑computing concepts, including schedulers, resource requests, queues/partitions, shared file systems, and multi‑user compute environments;
Slurm experience is preferred
Ability to troubleshoot issues across systems, networking, storage, identity, containers, schedulers, and user workloads, and to follow problems through to a reliable operational fix
Experience with scripting, automation, and version‑controlled operational changes using tools such as Git, CI/CD, Terraform, Ansible, Helm, or similar
Ability to work closely with multidisciplinary research teams, understand scientific computing needs, and deliver practical services that advance scientific goals
Strong communication and documentation skills, with the ability to explain technical concepts clearly to scientists, engineers, and non‑specialist audiences
A proactive, learning‑oriented approach suited to a new team building and improving a platform while also operating it day to day

Desirable Knowledge, Skills and Experience

Experience operating Slurm clusters, Slinky/slurm‑operator, Open OnDemand, Jupyter Lab services, or other researcher‑facing HPC portals and access patterns
Experience with Kubernetes or managed Kubernetes platforms such as OCI OKE, EKS, GKE, or AKS, including Helm, Argo CD, operators, services, storage classes, and workload…