HPC Infrastructure DevOps Engineer II
Listed on 2026-06-04
-
IT/Tech
Systems Engineer, IT Support
About St. Jude Children's Research Hospital
About St. Jude Children's Research Hospital. The World’s Most Dedicated Never Give Up - There’s a reason St. Jude Children’s Research Hospital consistently earns a Glassdoor Employee Choice Award and is named to its "Best Place to Work" list. Because at our world-class pediatric research hospital, every one of our professionals shares our commitment to make a difference in the lives of the children we serve.
There’s a unique bond when you’re part of a team that gives their all to advance the treatments and cures of pediatric catastrophic diseases. The result is a collaborative, positive environment where everyone, regardless of their role, receives the resources, support, and encouragement to advance and grow their careers and be the force behind the cures. St. Jude is where those with a passion for making a difference come to break new ground!
Located in Memphis, Tennessee, the mission of St. Jude Children’s Research Hospital is to advance cures, and means of prevention, for pediatric catastrophic diseases through research and treatment. We are leading the way the world understands, treats, and defeats childhood cancer and other life‑threatening diseases.
St. Jude is seeking an HPC Infrastructure Dev Ops Engineer II to join the High‑Performance Computing Support (HPCS) team. This role is responsible for the smooth operation, automation, and continuous improvement of St. Jude’s high‑performance computing environment, with a focus on HPC operations, Dev Ops practices, and automation for configuration, testing, monitoring, and autonomous remediation. The position supports a modern research computing ecosystem spanning on‑premises and remote‑site infrastructure, including
- HPC compute platforms for research and data‑intensive workloads
- GPU‑enabled environments for AI and machine learning applications
- High‑capacity research, compliant, and scratch storage tiers
- Archival, backup, and disaster recovery services
- Operational tooling for observability, governance, and process automation
Working closely with infrastructure, storage, security, and research teams, the HPC Infrastructure Dev Ops Engineer II will deliver reliable and scalable services for computational science, regulated workflows, and AI‑enabled research. This role is central to the HPCS service portfolio, including daily HPC client request fulfillment, performance and utilization monitoring, data management and governance, data cataloguing and archival services, and HPC process automation Dev Ops.
Job Responsibilities
HPC Infrastructure Operations- Support the day‑to‑day operation of St. Jude’s HPC infrastructure across compute and storage platforms.
- Maintain a stable, secure, and scalable environment for research computing and data‑intensive scientific workflows.
- Work with downstream operational teams to ensure systems are configured, validated, monitored, patched, and maintained effectively.
- Participate in infrastructure testing, upgrade activities, service transitions, and operational readiness efforts.
- Contribute to the reliability and supportability of hybrid HPC environments spanning primary and remote‑site services.
- Respond to daily user requests involving HPC access, Linux environment support, storage allocation, software availability, job troubleshooting, and data movement.
- Provide timely and effective support to researchers, analysts, and technical staff using HPC and AI‑enabled research resources.
- Resolve service incidents and user issues through structured troubleshooting and escalation as needed.
- Maintain service‑oriented communication with users and stakeholders to support a high‑quality support experience.
- Implement and improve monitoring for compute nodes, GPU resources, scheduler activity, storage systems, backup operations, and platform health.
- Track usage trends, availability, capacity consumption, and operational KPIs to support efficient service delivery.
- Analyze utilization patterns and recommend improvements to throughput, performance tuning, scheduling efficiency, and user experience.
- Build and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).