Senior HPC Cluster Systems Administrator
Listed on 2025-12-02
-
IT/Tech
Cloud Computing, Systems Engineer
Berkeley Lab’s ( LBNL ) Information Technology Division ( IT ) has an opening for a Senior HPC Cluster Systems Administrator to join their Science
IT Team !
In this exciting role, you will support the Berkeley Lab research community by building, integrating, and maintaining Linux-based resources, high-performance computing cluster systems, and Kubernetes clusters. This role provides extensive expertise in High Performance Computing infrastructure and delivers advanced Linux solutions to further scientific endeavors at Berkeley Lab. The mission of Scientific Computing under Science
IT is to facilitate groundbreaking fundamental research globally by providing essential computing tools, networks, and expertise to enable pioneering science.
This position has an anticipated start date of January 5, 2026.
We’re here for the same mission, to bring science solutions to the world. Join our team and YOU will play a supporting role in our goal to address global challenges! Have a high level of impact and work for an organization associated with 17 Nobel Prizes!
We invest in our employees by offering a total rewards package you can count on:
- Exceptional health and retirement benefits , including pension or 401K-style plans
- A culture where you’ll belong - we are invested in our teams!
- In addition to accruing vacation and sick time, we also have an annual Winter Holiday Shutdown
- Parental bonding leave (for both mothers and fathers)
- Perform Linux system and HPC cluster maintenance and installations, operating system upgrades, system security hardening and intrusion detection, storage and file system management, system hardware, customization of user group working environment, troubleshooting, network monitoring, and crash recovery.
- Design, deploy, and manage scalable applications using Kubernetes, ensuring the availability, performance, and readiness of the Kubernetes infrastructure.
- Automate deployment, scaling, and management of containerized applications, and collaborating with Dev Ops and development teams to streamline CI/CD pipelines.
- Design, deploy, and manage the global storage platform to ensure high performance, massive scalability, reliability, and future-proof solutions.
- Support storage technologies such as Lustre, VAST, and networks.
- Resolve I/O issues related to business applications, including diagnosing and resolving complex storage, Linux, and networking challenges in a fast-paced environment.
- Research new storage management technologies, techniques, and provide recommendations.
- Participate in developing system administration, security, and network policies, documentation, and tools oriented towards efficient systems management.
- Participate in cluster support to staff and researchers, including initial installation, integration, and ongoing maintenance of Linux High-Performance Computing cluster systems. This includes travel to remote sites if as needed.
- Co-leading technical efforts with other senior system administrators in areas of HPC technologies such as job schedulers, high-performance interconnects, parallel file systems, cybersecurity, cluster management, container orchestration, VM infrastructure, networking, performance tuning, or data center planning.
- Co-leading group projects of small to medium size and complexity, to implement and deploy new computing technologies and associated services to the research community.
- A Bachelor’s Degree (or equivalent knowledge/training) in Computer Science, Engineering, or a related discipline, and a minimum of 12 years of relevant experience in Linux system administration within a large distributed computing environment, including experience providing systems and end-user support for multiple scientific or computational research groups or an equivalent combination of education and experience.
- Demonstrated ability to manage large-scale, performance-critical environments, including capacity planning, scaling, and optimization.
- Significant experience deploying, scaling, and managing Kubernetes clusters, with a strong understanding of its architecture (pods, deployments, services, ingress) and container orchestration. Proven…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).