×
Register Here to Apply for Jobs or Post Jobs. X

AI Research Computing Infrastructure Engineer

Job in Frederick, Frederick County, Maryland, 21701, USA
Listing for: BioSpace
Full Time position
Listed on 2026-06-13
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, IT Infrastructure
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Overview

Research Computing Infrastructure Engineer at Frederick National Laboratory (Leidos Biomedical Research, Inc.). The Frederick National Laboratory addresses critical biomedical research in cancer, AIDS, drug development, nanotechnology, and infectious disease response.

Accountability, Compassion, Collaboration, Dedication, Integrity and Versatility; it's the FNL way.

Program Description

The mission of Enterprise Information Technology (EIT) is to develop an enterprise‑level, consolidated information technology infrastructure that provides exceptional IT capabilities to the Frederick National Labs for Cancer Research (NCI-Frederick/FNLCR) in support of basic, translational, and clinical cancer and AIDS research. The IT Operations Group (ITOG) within Leidos Biomedical Research, Inc. manages computational servers, storage, virtual machine infrastructure, and the FNLCR network, focusing on implementing enterprise IT best practices across computational services, storage, backup, archiving, batch and application support, server consolidation and virtualization, network infrastructure, communication technologies, and improved infrastructure for collocation of dedicated servers.

Key

Roles / Responsibilities

The Research Computing Infrastructure Engineer designs, builds, and operates next‑generation high‑performance computing (HPC) environments that support container‑based workflows and GPU‑accelerated research computing.

Responsibilities include:

  • Design and implement next‑generation HPC environments that leverage container‑driven workflows for GPU‑accelerated research.
  • Build and maintain container orchestration systems for batch and distributed workloads.
  • Integrate containerized job workflows with existing HPC schedulers and storage systems.
  • Develop and maintain job templates for batch GPU training and multi‑node distributed computing.
  • Automate deployment, configuration, and scaling through infrastructure‑as‑code and CI/CD practices.
  • Monitor, benchmark, and optimize system performance, reliability, and resource utilization.
  • Collaborate with researchers to containerize and optimize legacy workflows for scalable execution.
  • Lead evaluation of emerging tools (e.g., Prefect, Ray, Airflow, Dagster) for workflow orchestration and distributed computing.
  • Contribute to development of tools and bridges between orchestration frameworks and traditional HPC environments.
Basic Qualifications
  • Possession of a Bachelor’s degree from an accredited college/university or four (4) years of relevant experience.
  • Minimum of eight (8) years of related experience.
  • Strong Linux systems engineering and administration experience.
  • Hands‑on experience with container orchestration tools such as Kubernetes, Nomad, Run:

    AI, etc.
  • Hands‑on experience with scripting/programming skills (Python, Bash, or Go) for automation, monitoring, and job orchestration.
  • Experience with infrastructure‑as‑code / automation tooling (Terraform, Ansible, Packer, or equivalent).
  • Familiarity with system performance analysis, monitoring, and tuning.
  • Comfortable with small‑team environments and taking end‑to‑end ownership of compute infrastructure.
  • Ability to obtain and maintain a security clearance.
Preferred Qualifications
  • Experience with multi‑node distributed ML frameworks (PyTorch DDP, Ray, Horovod, Tensor Flow, etc).
  • Familiarity with pipeline orchestration tools (Prefect, Airflow, Dagster, Kubeflow).
  • Understanding of resource management and scheduling concepts (queues, allocations, GPU device plugins, gang scheduling, multi‑node coordination).
  • Understanding of storage integration with high‑performance clusters (POSIX + object storage, VAST or similar).
  • Familiarity with cloud GPU environments (AWS, GCP, Azure) and hybrid workflows.
  • Good communication and documentation skills, the ability to make complex infrastructure understandable to researchers and other engineers.
Expected Competencies
  • Expertise in Kubernetes, Nomad, or equivalent container orchestration systems for large‑scale computing.
  • Deep knowledge of Linux systems administration, performance tuning, and automation.
  • Ability to translate research computing needs into scalable, reliable infrastructure designs.
  • Commitment…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary