HPC Engineer - Generative Biology Institute
Listed on 2026-06-02
-
IT/Tech
Cloud Computing, IT Support, Systems Engineer
Your Role
Working as part of a new Scientific Computing team within GBI, the HPC Engineer will help operate, improve, and scale the data and computing platform that will enable cutting‑edge research in engineering biology. This is a broad, hands‑on role at the interface of Linux systems, high‑performance computing, cloud infrastructure, Kubernetes, Slurm, storage, monitoring, and researcher support. They will help turn emerging researcher needs and operational lessons into robust platform improvements, reusable tooling, and clear runbooks.
This role is particularly suited to someone who enjoys practical systems work, learning new technologies, and collaborating closely with scientists and engineers. We do not expect candidates to have deep experience in every technology listed in this description. Instead, we are looking for a strong, scientifically minded systems engineer: someone who can troubleshoot complex environments, communicate clearly with multidisciplinary teams, learn unfamiliar tools quickly, and help build reliable, scalable services that advance GBI’s scientific mission.
Key Responsibilities- Operate, maintain, and improve GBI’s hybrid HPC platform, including Linux‑based compute environments, Slurm/Slinky workloads, Kubernetes/OKE services, Open OnDemand, GPU and CPU partitions, and shared storage
- Help provision, configure, scale, and validate compute, storage, networking, and platform services using infrastructure as code, configuration management, and automation tools such as Terraform, Helm and Ansible
- Monitor platform health, capacity, job scheduling, GPU utilisation, storage behaviour, and network performance; investigate issues using tools such as Prometheus and Grafana
- Support researchers in using our Scientific Computing Platform, including triaging user issues and translating common pain points into platform improvements
- Build and maintain reproducible runtime environments, container images, and workflow‑supporting services for scientific computing workloads, including bioinformatics, AI/ML, data processing, and simulation workflows
- Contribute to safe rollout and maintenance processes for Slurm images, worker node pools, scheduler configuration, container runtime changes, security updates, and monitoring improvements
- Create and maintain clear technical documentation, runbooks, validation checks, and issue/PR notes so the platform can be operated consistently and improved safely by the wider team
- Bachelor’s or Master’s degree in Computer Science, Computational Biology, Engineering, Physics, Mathematics, or a related discipline, or equivalent practical experience
- Hands‑on experience supporting or administering Linux‑based systems in an HPC, cloud, research, academic, or production environment
- Working knowledge of HPC or batch‑computing concepts, including schedulers, resource requests, queues/partitions, shared file systems, and multi‑user compute environments;
Slurm experience is preferred - Ability to troubleshoot issues across systems, networking, storage, identity, containers, schedulers, and user workloads, and to follow problems through to a reliable operational fix
- Experience with scripting, automation, and version‑controlled operational changes using tools such as Git, CI/CD, Terraform, Ansible, Helm, or similar
- Ability to work closely with multidisciplinary research teams, understand scientific computing needs, and deliver practical services that advance scientific goals
- Strong communication and documentation skills, with the ability to explain technical concepts clearly to scientists, engineers, and non‑specialist audiences
- A proactive, learning‑oriented approach suited to a new team building and improving a platform while also operating it day to day
- Experience operating Slurm clusters, Slinky/slurm‑operator, Open OnDemand, Jupyter Lab services, or other researcher‑facing HPC portals and access patterns
- Experience with Kubernetes or managed Kubernetes platforms such as OCI OKE, EKS, GKE, or AKS, including Helm, Argo CD, operators, services, storage classes, and workload…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: