×
Register Here to Apply for Jobs or Post Jobs. X

HPC Infrastructure DevOps Engineer II

Job in Memphis, Shelby County, Tennessee, 37544, USA
Listing for: St. Jude Children's Research Hospital, Inc.
Full Time position
Listed on 2026-05-31
Job specializations:
  • IT/Tech
    IT Support, Data Engineer, Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Position Overview

St. Jude is seeking an HPC Infrastructure Dev Ops Engineer II to join the High‑Performance Computing Support (HPCS) team. This role is responsible for the smooth operation, automation, and continuous improvement of St.Jude’s high‑performance computing environment, with a focus on HPC operations, Dev Ops practices, and automation for configuration, testing, monitoring, and autonomous remediation.

  • HPC compute platforms for research and data‑intensive workloads
  • GPU‑enabled environments for AI and machine learning applications
  • High‑capacity research, compliant, and scratch storage tiers
  • Archival, backup, and disaster recovery services
  • Operational tooling for observability, governance, and process automation

Working closely with infrastructure, storage, security, and research teams, the HPC Infrastructure Dev Ops Engineer II will deliver reliable and scalable services for computational science, regulated workflows, and AI‑enabled research. This role is central to the HPCS service portfolio, including daily HPC client request fulfillment, performance and utilization monitoring, data management and governance, data cataloguing and archival services, and HPC process automation Dev Ops.

Job Responsibilities

HPC Infrastructure Operations

Support the day‑to‑day operation of St.Jude’s HPC infrastructure across compute and storage platforms. Maintain a stable, secure, and scalable environment for research computing and data‑intensive scientific workflows. Work with downstream operational teams to ensure systems are configured, validated, monitored, patched, and maintained effectively. Participate in infrastructure testing, upgrade activities, service transitions, and operational readiness efforts. Contribute to the reliability and supportability of hybrid HPC environments spanning primary and remote‑site services.

Daily

HPC Client Request Fulfillment

Respond to daily user requests involving HPC access, Linux environment support, storage allocation, software availability, job troubleshooting, and data movement. Provide timely and effective support to researchers, analysts, and technical staff using HPC and AI‑enabled research resources. Resolve service incidents and user issues through structured troubleshooting and escalation as needed. Maintain service‑oriented communication with users and stakeholders to support a high‑quality support experience.

Performance

and Utilization Monitoring

Implement and improve monitoring for compute nodes, GPU resources, scheduler activity, storage systems, backup operations, and platform health. Track usage trends, availability, capacity consumption, and operational KPIs to support efficient service delivery. Analyze utilization patterns and recommend improvements to throughput, performance tuning, scheduling efficiency, and user experience. Build and maintain dashboards, metrics collection workflows, health checks, and alerting mechanisms to support proactive operations and continuous process improvement.

Support governance reporting and visibility into service consumption and infrastructure health.

Data Management and Governance

Support operational controls for research and compliant data across active storage, protected environments, backup systems, and archival tiers. Implement and maintain standards for data handling, retention, access control, traceability, and lifecycle operations. Contribute to governance tracking and reporting for HPC‑supported data services. Assist with data movement and retention workflows across high‑performance, compliant, backup, and archival storage platforms.

Data Cataloguing and Archival Services

Support data intake, metadata‑aware cataloguing, archival placement, recall, restore validation, and tier‑to‑tier data movement. Assist with workflows involving archival platforms, cold storage, backup systems, and long‑term retention services. Improve discoverability and lifecycle management of research datasets through automation and procedural standardization. Support operational validation of archival and recovery workflows for critical data services.

HPC Process Automation Dev Ops

Use automation tooling to handle system configuration,…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary