AI Research Computing Infrastructure Engineer Job Frederick area,Maryland USA,IT/Tech

Overview

Research Computing Infrastructure Engineer at Frederick National Laboratory (Leidos Biomedical Research, Inc.). The Frederick National Laboratory addresses critical biomedical research in cancer, AIDS, drug development, nanotechnology, and infectious disease response.

Accountability, Compassion, Collaboration, Dedication, Integrity and Versatility; it's the FNL way.

Program Description

The mission of Enterprise Information Technology (EIT) is to develop an enterprise‑level, consolidated information technology infrastructure that provides exceptional IT capabilities to the Frederick National Labs for Cancer Research (NCI-Frederick/FNLCR) in support of basic, translational, and clinical cancer and AIDS research. The IT Operations Group (ITOG) within Leidos Biomedical Research, Inc. manages computational servers, storage, virtual machine infrastructure, and the FNLCR network, focusing on implementing enterprise IT best practices across computational services, storage, backup, archiving, batch and application support, server consolidation and virtualization, network infrastructure, communication technologies, and improved infrastructure for collocation of dedicated servers.

Key

Roles / Responsibilities

The Research Computing Infrastructure Engineer designs, builds, and operates next‑generation high‑performance computing (HPC) environments that support container‑based workflows and GPU‑accelerated research computing.

Responsibilities include:

Design and implement next‑generation HPC environments that leverage container‑driven workflows for GPU‑accelerated research.
Build and maintain container orchestration systems for batch and distributed workloads.
Integrate containerized job workflows with existing HPC schedulers and storage systems.
Develop and maintain job templates for batch GPU training and multi‑node distributed computing.
Automate deployment, configuration, and scaling through infrastructure‑as‑code and CI/CD practices.
Monitor, benchmark, and optimize system performance, reliability, and resource utilization.
Collaborate with researchers to containerize and optimize legacy workflows for scalable execution.
Lead evaluation of emerging tools (e.g., Prefect, Ray, Airflow, Dagster) for workflow orchestration and distributed computing.
Contribute to development of tools and bridges between orchestration frameworks and traditional HPC environments.

Basic Qualifications

Possession of a Bachelor’s degree from an accredited college/university or four (4) years of relevant experience.
Minimum of eight (8) years of related experience.
Strong Linux systems engineering and administration experience.
Hands‑on experience with container orchestration tools such as Kubernetes, Nomad, Run:

AI, etc.
Hands‑on experience with scripting/programming skills (Python, Bash, or Go) for automation, monitoring, and job orchestration.
Experience with infrastructure‑as‑code / automation tooling (Terraform, Ansible, Packer, or equivalent).
Familiarity with system performance analysis, monitoring, and tuning.
Comfortable with small‑team environments and taking end‑to‑end ownership of compute infrastructure.
Ability to obtain and maintain a security clearance.

Preferred Qualifications

Experience with multi‑node distributed ML frameworks (PyTorch DDP, Ray, Horovod, Tensor Flow, etc).
Familiarity with pipeline orchestration tools (Prefect, Airflow, Dagster, Kubeflow).
Understanding of resource management and scheduling concepts (queues, allocations, GPU device plugins, gang scheduling, multi‑node coordination).
Understanding of storage integration with high‑performance clusters (POSIX + object storage, VAST or similar).
Familiarity with cloud GPU environments (AWS, GCP, Azure) and hybrid workflows.
Good communication and documentation skills, the ability to make complex infrastructure understandable to researchers and other engineers.

Expected Competencies

Expertise in Kubernetes, Nomad, or equivalent container orchestration systems for large‑scale computing.
Deep knowledge of Linux systems administration, performance tuning, and automation.
Ability to translate research computing needs into scalable, reliable infrastructure designs.
Commitment…