AI Research Computing Infrastructure Engineer
Listed on 2026-06-13
-
IT/Tech
Systems Engineer, Cloud Computing, IT Infrastructure
Overview
Research Computing Infrastructure Engineer at Frederick National Laboratory (Leidos Biomedical Research, Inc.). The Frederick National Laboratory addresses critical biomedical research in cancer, AIDS, drug development, nanotechnology, and infectious disease response.
Accountability, Compassion, Collaboration, Dedication, Integrity and Versatility; it's the FNL way.
Program DescriptionThe mission of Enterprise Information Technology (EIT) is to develop an enterprise‑level, consolidated information technology infrastructure that provides exceptional IT capabilities to the Frederick National Labs for Cancer Research (NCI-Frederick/FNLCR) in support of basic, translational, and clinical cancer and AIDS research. The IT Operations Group (ITOG) within Leidos Biomedical Research, Inc. manages computational servers, storage, virtual machine infrastructure, and the FNLCR network, focusing on implementing enterprise IT best practices across computational services, storage, backup, archiving, batch and application support, server consolidation and virtualization, network infrastructure, communication technologies, and improved infrastructure for collocation of dedicated servers.
KeyRoles / Responsibilities
The Research Computing Infrastructure Engineer designs, builds, and operates next‑generation high‑performance computing (HPC) environments that support container‑based workflows and GPU‑accelerated research computing.
Responsibilities include:
- Design and implement next‑generation HPC environments that leverage container‑driven workflows for GPU‑accelerated research.
- Build and maintain container orchestration systems for batch and distributed workloads.
- Integrate containerized job workflows with existing HPC schedulers and storage systems.
- Develop and maintain job templates for batch GPU training and multi‑node distributed computing.
- Automate deployment, configuration, and scaling through infrastructure‑as‑code and CI/CD practices.
- Monitor, benchmark, and optimize system performance, reliability, and resource utilization.
- Collaborate with researchers to containerize and optimize legacy workflows for scalable execution.
- Lead evaluation of emerging tools (e.g., Prefect, Ray, Airflow, Dagster) for workflow orchestration and distributed computing.
- Contribute to development of tools and bridges between orchestration frameworks and traditional HPC environments.
- Possession of a Bachelor’s degree from an accredited college/university or four (4) years of relevant experience.
- Minimum of eight (8) years of related experience.
- Strong Linux systems engineering and administration experience.
- Hands‑on experience with container orchestration tools such as Kubernetes, Nomad, Run:
AI, etc. - Hands‑on experience with scripting/programming skills (Python, Bash, or Go) for automation, monitoring, and job orchestration.
- Experience with infrastructure‑as‑code / automation tooling (Terraform, Ansible, Packer, or equivalent).
- Familiarity with system performance analysis, monitoring, and tuning.
- Comfortable with small‑team environments and taking end‑to‑end ownership of compute infrastructure.
- Ability to obtain and maintain a security clearance.
- Experience with multi‑node distributed ML frameworks (PyTorch DDP, Ray, Horovod, Tensor Flow, etc).
- Familiarity with pipeline orchestration tools (Prefect, Airflow, Dagster, Kubeflow).
- Understanding of resource management and scheduling concepts (queues, allocations, GPU device plugins, gang scheduling, multi‑node coordination).
- Understanding of storage integration with high‑performance clusters (POSIX + object storage, VAST or similar).
- Familiarity with cloud GPU environments (AWS, GCP, Azure) and hybrid workflows.
- Good communication and documentation skills, the ability to make complex infrastructure understandable to researchers and other engineers.
- Expertise in Kubernetes, Nomad, or equivalent container orchestration systems for large‑scale computing.
- Deep knowledge of Linux systems administration, performance tuning, and automation.
- Ability to translate research computing needs into scalable, reliable infrastructure designs.
- Commitment…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).