HPC Infrastructure DevOps Engineer II
Listed on 2026-04-29
-
IT/Tech
IT Support, Systems Engineer, Cloud Computing, Data Security
Position Overview
St. Jude is seeking an HPC Infrastructure Dev Ops Engineer II to join the High-Performance Computing Support (HPCS) team. This role is responsible for the smooth operation, automation, and continuous improvement of St. Jude’s high-performance computing environment, with a focus on HPC operations, Dev Ops practices, and automation for configuration, testing, monitoring, and autonomous remediation. The position supports a modern research computing ecosystem spanning on-premises and remote-site infrastructure, including HPC compute platforms for research and data-intensive workloads, GPU-enabled environments for AI and machine learning applications, high-capacity research, compliant, and scratch storage tiers, archival, backup, and disaster recovery services, and operational tooling for observability, governance, and process automation.
Job Responsibilities- HPC Infrastructure Operations:
Support the day-to-day operation of St. Jude’s HPC infrastructure across compute and storage platforms. - Maintain a stable, secure, and scalable environment for research computing and data-intensive scientific workflows.
- Work with downstream operational teams to ensure systems are configured, validated, monitored, patched, and maintained effectively.
- Participate in infrastructure testing, upgrade activities, service transitions, and operational readiness efforts.
- Contribute to the reliability and supportability of hybrid HPC environments spanning primary and remote-site services.
- Daily HPC Client Request Fulfillment:
Respond to daily user requests involving HPC access, Linux environment support, storage allocation, software availability, job troubleshooting, and data movement. - Provide timely and effective support to researchers, analysts, and technical staff using HPC and AI-enabled research resources.
- Resolve service incidents and user issues through structured troubleshooting and escalation as needed.
- Maintain service-oriented communication with users and stakeholders to support a high-quality support experience.
- Performance and Utilization Monitoring:
Implement and improve monitoring for compute nodes, GPU resources, scheduler activity, storage systems, backup operations, and platform health. - Track usage trends, availability, capacity consumption, and operational KPIs to support efficient service delivery.
- Analyze utilization patterns and recommend improvements to throughput, performance tuning, scheduling efficiency, and user experience.
- Build and maintain dashboards, metrics collection workflows, health checks, and alerting mechanisms to support proactive operations and continuous process improvement.
- Support governance reporting and visibility into service consumption and infrastructure health.
- Data Management and Governance:
Support operational controls for research and compliant data across active storage, protected environments, backup systems, and archival tiers. - Implement and maintain standards for data handling, retention, access control, traceability, and lifecycle operations.
- Contribute to governance tracking and reporting for HPC-supported data services.
- Assist with data movement and retention workflows across high-performance, compliant, backup, and archival storage platforms.
- Data Cataloguing and Archival Services:
Support data intake, metadata-aware cataloguing, archival placement, recall, restore validation, and tier-to-tier data movement. - Assist with workflows involving archival platforms, cold storage, backup systems, and long-term retention services.
- Improve discoverability and lifecycle management of research datasets through automation and procedural standardization.
- Support operational validation of archival and recovery workflows for critical data services.
- HPC Process Automation Dev Ops:
Use automation tooling to handle system configuration, provisioning, platform maintenance, testing, and operational workflows. - Enable Dev Ops lifecycle functions by supporting tooling and processes for development, testing, release, and operational support.
- Build and maintain CI/CD pipelines and repeatable infrastructure workflows to improve reliability, consistency, and deployment speed.
- Reduce manual…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).