HPC Sr. Scientific Software Engineer; IT@JH Research Computing
Listed on 2025-12-02
-
IT/Tech
Systems Engineer, AI Engineer, Cloud Computing, IT Support
IT@JH Research Computing is seeking a HPC Sr. Scientific Software Engineer who will design, build, and support Johns Hopkins University’s high-performance computing and AI research infrastructure. This role integrates elements of both systems and software engineering, ensuring scalable, secure, and reproducible environments for scientific and data-intensive research. The Engineer develops and automates system and application workflows across CPU/GPU clusters, parallel storage, and hybrid cloud platforms.
Responsibilities include configuring and optimizing large-scale Linux environments, implementing job scheduling and orchestration frameworks, containerizing applications, and supporting researchers in optimizing performance and reproducibility. Work combines project-based engineering with operational support, requiring both independent problem-solving and close collaboration with the Research Computing team and faculty stakeholders.
- Develop and refine deployment strategies for scientific software on HPC and AI systems.
- Design computational workflows, selecting optimal software configurations, and utilizing tools like Ansible for automation.
- Assist teams in implementing, tuning, and optimizing AI models and gateway applications (e.g., XDMoD, Coldfront, Open OnDemand, CryoSPARC Live, SBGrid, AI Agents).
- Analyze and optimize the performance of AI models and HPC applications, focusing on GPU-enabled computing.
- Implement parallel processing, distributed computing, and resource management techniques for efficient job execution.
- Develop, debug, and maintain software tools, libraries, and frameworks supporting HPC and AI workloads.
- Collaborate with the system team and software vendors (e.g., NVIDIA, Intel, Matlab) to optimize systems for maximum performance.
- Utilize CUDA, DNN, Tensor
RT, and Intel Compilers to enhance system performance.
- Manage and support scientific software deployment across HPC, cloud-based, and colocation facilities.
- Oversee installation, configuration, and maintenance of HPC packages with tools like CMake, Make, Easy Build, Spack, and Lua module files.
- Work closely with cross-functional teams, including researchers, data scientists, and software developers, to address complex HPC/AI challenges.
- Mentor junior engineers and foster a culture of continuous learning.
- Resolve complex technical issues and perform root cause analysis for HPC/AI software challenges.
- Implement effective solutions to prevent recurrence and improve system reliability.
- Provide training workshops for researchers and students, focusing on troubleshooting, optimizing workflows, and effectively using HPC systems.
- Stay current with advances in HPC and AI technologies and methodologies.
- Incorporate new research findings into existing systems to improve performance and capabilities.
- Develop and manage container orchestration strategies to ensure scalability, reliability, and security of applications.
- Oversee the container lifecycle from creation and deployment to scaling and removal.
- Create comprehensive documentation for system designs, performance metrics, and project status.
- Ensure compliance with security and regulatory standards for all HPC and AI systems.
- Design, deploy, and maintain large-scale Linux HPC clusters with CPU/GPU resources, high-speed networks, and distributed storage.
- Develop and maintain automation frameworks for provisioning, monitoring, and software lifecycle management.
- Implement and optimize job scheduling, container orchestration, and workflow automation tools to support diverse research workloads.
- Collaborate with faculty and research teams to parallelize, containerize, and scale computational workflows for multi-GPU and distributed environments.
- Benchmark and tune application performance across architectures, documenting findings and sharing best practices.
- Integrat…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).