×
Register Here to Apply for Jobs or Post Jobs. X

HPC Infrastructure DevOps Engineer II

Job in Memphis, Shelby County, Tennessee, 37544, USA
Listing for: St. Jude Children's Research Hospital
Full Time position
Listed on 2026-04-29
Job specializations:
  • IT/Tech
    IT Support, Systems Engineer, Cloud Computing, Data Security
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Position Overview

St. Jude is seeking an HPC Infrastructure Dev Ops Engineer II to join the High-Performance Computing Support (HPCS) team. This role is responsible for the smooth operation, automation, and continuous improvement of St. Jude’s high-performance computing environment, with a focus on HPC operations, Dev Ops practices, and automation for configuration, testing, monitoring, and autonomous remediation. The position supports a modern research computing ecosystem spanning on-premises and remote-site infrastructure, including HPC compute platforms for research and data-intensive workloads, GPU-enabled environments for AI and machine learning applications, high-capacity research, compliant, and scratch storage tiers, archival, backup, and disaster recovery services, and operational tooling for observability, governance, and process automation.

Job Responsibilities
  • HPC Infrastructure Operations:
    Support the day-to-day operation of St. Jude’s HPC infrastructure across compute and storage platforms.
  • Maintain a stable, secure, and scalable environment for research computing and data-intensive scientific workflows.
  • Work with downstream operational teams to ensure systems are configured, validated, monitored, patched, and maintained effectively.
  • Participate in infrastructure testing, upgrade activities, service transitions, and operational readiness efforts.
  • Contribute to the reliability and supportability of hybrid HPC environments spanning primary and remote-site services.
  • Daily HPC Client Request Fulfillment:
    Respond to daily user requests involving HPC access, Linux environment support, storage allocation, software availability, job troubleshooting, and data movement.
  • Provide timely and effective support to researchers, analysts, and technical staff using HPC and AI-enabled research resources.
  • Resolve service incidents and user issues through structured troubleshooting and escalation as needed.
  • Maintain service-oriented communication with users and stakeholders to support a high-quality support experience.
  • Performance and Utilization Monitoring:
    Implement and improve monitoring for compute nodes, GPU resources, scheduler activity, storage systems, backup operations, and platform health.
  • Track usage trends, availability, capacity consumption, and operational KPIs to support efficient service delivery.
  • Analyze utilization patterns and recommend improvements to throughput, performance tuning, scheduling efficiency, and user experience.
  • Build and maintain dashboards, metrics collection workflows, health checks, and alerting mechanisms to support proactive operations and continuous process improvement.
  • Support governance reporting and visibility into service consumption and infrastructure health.
  • Data Management and Governance:
    Support operational controls for research and compliant data across active storage, protected environments, backup systems, and archival tiers.
  • Implement and maintain standards for data handling, retention, access control, traceability, and lifecycle operations.
  • Contribute to governance tracking and reporting for HPC-supported data services.
  • Assist with data movement and retention workflows across high-performance, compliant, backup, and archival storage platforms.
  • Data Cataloguing and Archival Services:
    Support data intake, metadata-aware cataloguing, archival placement, recall, restore validation, and tier-to-tier data movement.
  • Assist with workflows involving archival platforms, cold storage, backup systems, and long-term retention services.
  • Improve discoverability and lifecycle management of research datasets through automation and procedural standardization.
  • Support operational validation of archival and recovery workflows for critical data services.
  • HPC Process Automation Dev Ops:
    Use automation tooling to handle system configuration, provisioning, platform maintenance, testing, and operational workflows.
  • Enable Dev Ops lifecycle functions by supporting tooling and processes for development, testing, release, and operational support.
  • Build and maintain CI/CD pipelines and repeatable infrastructure workflows to improve reliability, consistency, and deployment speed.
  • Reduce manual…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary