System Infrastructure/Platform Engineer, HPC Technology Department
Listed on 2026-06-27
-
IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, Systems Administrator
System Infrastructure / Platform Engineer
The National Energy Research Scientific Computing Center (NERSC) is seeking a System Infrastructure / Platform Engineer to help build and manage HPC systems and Linux-based infrastructure. NERSC operates some of the world's largest supercomputers, supporting thousands of researchers tackling major scientific challenges.
In this role, you will manage high-performance computing environments, including HPC systems, containers, virtual machines, and core infrastructure services. You'll work with cutting-edge technologies such as CPU/GPU clusters, parallel storage, high-speed networking, Slurm, and Kubernetes, balancing innovation with reliability, performance, and security at scale.
Collaborating with engineers, researchers, vendors, and open-source communities, you will help develop scalable solutions that advance scientific discovery and the future of HPC. If you have Linux experience, an interest in science, and enjoy fast-paced collaborative environments, NERSC would love to hear from you.
We're here for the same mission, to bring science solutions to the world. Join our team and YOU will play a supporting role in our goal to address global challenges! Have a high level of impact and work for an organization associated with 17 Nobel Prizes!
Why join Berkeley Lab?
We invest in our employees by offering a total rewards package you can count on:
- Exceptional health and retirement benefits, including pension or 401K-style plans
- Opportunities to grow in your career - check out our Tuition Assistance Program
- A culture where you'll belong - we are invested in our teams!
- In addition to accruing vacation and sick time, we also have a Winter Holiday Shutdown every year.
- Parental bonding leave (for both mothers and fathers)
- Pet insurance
What You Will Do if hired at a Level 3:
- Build and manage Linux systems and storage infrastructure
- Troubleshoot complex technical issues with team members
- Install, upgrade, and secure systems and services
- Develop and maintain scripts and automation tools
- Participate in a 24/7 on-call rotation
- Lead small projects, upgrades, and service rollouts
- Collaborate with vendors to improve technologies and user experience
- Support reliable operations of NERSC's Perlmutter supercomputer and Spin Kubernetes platform
- Develop and integrate services across NERSC and DOE facilities, including the upcoming Doudna supercomputer
- Present technical work to the HPC community at conferences and industry events
In Additional Responsibilities if hired at a Level 4:
- Solve complex technical problems with independent judgment
- Develop team strategies and project plans
- Provide technical leadership and mentorship
- Lead system improvements for performance, reliability, and security
- Evaluate emerging HPC technologies and capabilities
- Represent NERSC in HPC and DOE technical communities and advocacy groups
What is Required to be hired at a Level 3:
- Typically, 8+ years of related experience with a Bachelor's degree; alternatively, 6+ years with a Master's degree; or equivalent career experience
- 4+ years of experience managing large-scale Linux-based system deployments in a high-performance computing, cloud computing, or hyper-scale environment
- Mastery of Linux concepts and operations (processes, networking, system logs, performance)
- Proficiency with bash and Python scripting
- Experience with some or all of our key technologies:
- containers (such as Docker or Kubernetes)
- virtualization (such as Proxmox or VMware)
- cloud-based deployment (such as AWS, Azure or GCP)
- identity and access management
- database administration, tuning, and troubleshooting
- storage systems technologies (such as iSCSI and NAS appliances)
- parallel file systems (such as Lustre, GPFS, or VAST)
- high-speed networking/interconnect (such as Infini Band, Slingshot, or RoCE)
- advanced performance analysis and debugging tools (such as strace, lsof, ebpf, or gdb)
- Dev Ops tools (such as Gitlab or Jira) and processes (such as issues, merge requests, and API/automation)
- Familiarity with automated provisioning systems (such as Chef, Foreman, or Terraform)
- Familiarity with configuration management systems (such as Ansible or Puppet)
- Working knowledge of Linux…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).