System Infrastructure/Platform Engineer,HPC Technology Department Job Berkeley area,California USA,IT/Tech

Position: System Infrastructure / Platform Engineer, HPC Technology Department

Overview

The National Energy Research Scientific Computing Center (NERSC) is seeking a System Infrastructure / Platform Engineer to help build and manage HPC systems and Linux-based infrastructure. NERSC operates some of the world’s largest supercomputers, supporting thousands of researchers tackling major scientific challenges.

In this role, you will manage high-performance computing environments, including HPC systems, containers, virtual machines, and core infrastructure services. You’ll work with cutting-edge technologies such as CPU/GPU clusters, parallel storage, high-speed networking, Slurm, and Kubernetes, balancing innovation with reliability, performance, and security will collaborate with engineers, researchers, vendors, and open-source communities to develop scalable solutions that advance scientific discovery and the future of HPC.

What

You Will Do

Build and manage Linux systems and storage infrastructure
Troubleshoot complex technical issues with team members
Install, upgrade, and secure systems and services
Develop and maintain scripts and automation tools
Participate in a 24/7 on-call rotation
Lead small projects, upgrades, and service rollouts
Collaborate with vendors to improve technologies and user experience
Support reliable operations of NERSC’s Perlmutter supercomputer and Spin Kubernetes platform
Develop and integrate services across NERSC and DOE facilities, including the upcoming Doudna supercomputer
Present technical work to the HPC community at conferences and industry events

Responsibilities

In addition to Level 3 responsibilities, Level 4 adds:
Solve complex technical problems with independent judgment; develop team strategies and project plans; provide technical leadership and mentorship; lead system improvements for performance, reliability, and security; evaluate emerging HPC technologies; represent NERSC in HPC and DOE technical communities and advocacy groups.

What is Required to be hired at a Level 3

Typically, 8+ years of related experience with a Bachelor’s degree; alternatively, 6+ years with a Master’s degree; or equivalent career experience
4+ years of experience managing large-scale Linux-based system deployments in a high-performance computing, cloud computing, or hyper-scale environment
Mastery of Linux concepts and operations (processes, networking, system logs, performance)
Proficiency with bash and Python scripting
Experience with some or all of our key technologies:
- containers (such as Docker or Kubernetes)
- virtualization (such as Proxmox or VMware)
- cloud-based deployment (such as AWS, Azure or GCP)
- identity and access management
- database administration, tuning, and troubleshooting
- storage systems technologies (such as iSCSI and NAS appliances)
- parallel file systems (such as Lustre, GPFS, or VAST)
- high-speed networking/interconnect (such as Infini Band, Slingshot, or RoCE)
- advanced performance analysis and debugging tools (such as strace, lsof, ebpf, or gdb)
- Dev Ops tools (such as Gitlab or Jira) and processes (such as issues, merge requests, and API/automation)
Familiarity with automated provisioning systems (such as Chef, Foreman, or Terraform)
Familiarity with configuration management systems (such as Ansible or Puppet)
Working knowledge of Linux system engineering and security practices
Ability to resolve complex issues in creative and effective ways and derive technical solutions in a collaborative environment to meet end user requirements or needs
Demonstrated ability to work independently as well as collaboratively in large projects, and contribute to an active and respectful intellectual environment
Creative, positive, and collaborative work style
Excellent oral and written communication skills

Requirements

Additional Requirements to be hired at a Level 4:
- Typically, 12+ years of related experience with a Bachelor’s degree; alternatively, 8+ years with a Master’s degree; or equivalent career experience
- Proven ability to lead troubleshooting and resolution of high-impact incidents in complex, large-scale environments
- Demonstrated leadership in cross-team collaboration and mentoring
- Experience in software engineering, Linux systems programming, or complex scripting
- Experience…

System Infrastructure​/Platform Engineer, HPC Technology Department

System Infrastructure/Platform Engineer, HPC Technology Department