Senior Linux HPC Storage Engineer
Listed on 2025-10-18
-
IT/Tech
Systems Engineer, Data Engineer, Cloud Computing
We are hiring a Senior Linux HPC Storage Engineer to design, operate and maintain clusters, servers, and workstations storage supporting services where science happens at ORNL! This position resides in the Emerging Technologies & Computing team in the Research Computing group in the Information Technology Services Directorate at Oak Ridge National Laboratory (ORNL).
The Emerging Technology Computational Group facilitates ORNL goals through HPC systems engineering, integration, and support for the research community providing design, deployment, optimization, monitoring, and tooling support across multiple clustered infrastructures, we facilitate Lab‑wide R&D projects. Our HPC clusters range in scope from just a handful of nodes to over fifty‑thousand cores.
We partner with ORNL research organizations to enable research excellence and delivery. We work with other clustered computing and HPC groups to help research programs identify the best solutions for their needs. When we build our customer's environments, our team collaborates to design, implement, and maintain the systems from inception to retirement.
Major Duties/Responsibilities- Architect, deploy, and manage large‑scale HPC storage systems, including parallel file systems such as Lustre, GPFS/Spectrum Scale, BeeGFS and WEKA.
- Design, implement, and operate large‑scale Ceph storage clusters for HPC and research workloads, delivering reliable, high‑performance object, block, and file storage services.
- Ensure the availability, performance, scalability, and security of production storage environments.
- Administer and optimize enterprise storage platforms such as Qumulo and Net App in support of HPC and research workloads.
- Design, deploy, and maintain archival storage solutions including Spectra Logic Black Pearl and large‑scale tape libraries to ensure long‑term data preservation and accessibility.
- Integrate high‑performance, enterprise, and archival storage layers into cohesive tiered storage architectures that balance cost, scalability, and performance for diverse scientific workflows.
- Leverage automation and monitoring solutions to minimize day‑to‑day maintenance while identifying opportunities to optimize system performance and management.
- Collaborate with researchers and technical POCs to support large data workflows and optimize I/O performance for scientific workloads.
- Automate storage provisioning, monitoring, and maintenance using scripting and configuration management tools.
- Diagnose and resolve complex storage and I/O‑related issues in high‑throughput, low‑latency HPC environments.
- Evaluate emerging storage technologies (NVMe, object storage, hierarchical storage management, burst buffers) and contribute to strategic planning for future HPC systems.
- Work with 24/7 operations staff to streamline monitoring and troubleshooting, significantly reducing the need for off‑hours support.
- Deliver ORNL’s mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote equal opportunity by fostering a respectful workplace – in how we treat one another, work together, and measure success.
- A BS degree in computer science, computer engineering, information technology, information systems, science, engineering, business, or a related discipline and a minimum of eight (8) to twelve (12) years of aligned professional experience is required for consideration. An overall combination of equivalent education and experience may be considered.
- Masters and PhD degree holders in the same fields of study are also encouraged to apply. Masters’ holders should have a minimum of seven (7) to ten (10) years of relevant and aligned experience. PhD holders should have a minimum of four (4) to six (6) years of relevant and aligned experience.
- Five (5) or more years managing UNIX/Linux systems.
- Demonstrated experience managing HPC storage and large‑scale enterprise storage systems.
- Three (3) or more years working with configuration management and automation tools such as Git, Jenkins, Ansible, or Puppet.
- Proficiency with at least one scripting language (Bash, Python, Perl, etc.).
- Strong…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).