AI Operations & Infrastructure Engineer Security Clearance Job Fort Meade area,Maryland USA,IT/Tech

Position: AI Operations & Infrastructure Engineer with Security Clearance
Title:

AI Operations & Infrastructure Engineer

Location:

Fort Meade, MD Clearance: TS/SCI with a CI Polygraph Job Details:
* Manage and maintain AI computing platforms, including GPUs and other specialized hardware
* Install and configure GPU drivers and software
* Oversee the AI software stack and tools
* Implement and manage containerization technologies like Docker and Kubernetes
* Configure and optimize networking infrastructure for AI workloads, including Infini Band and Ethernet
* Manage storage solutions for AI data, considering performance and capacity requirements
* Deploy and manage data processing units (DPUs) to accelerate data center workloads
* Monitor and manage AI cluster health and resource utilization
* Implement workload management and scheduling tools like Slurm and Kubernetes
* Ensure efficient power and cooling for AI infrastructure to maintain optimal operating conditions
* Configure high-performance networking solutions for AI and machine learning workloads
* Optimize network performance to ensure maximum throughput and minimal latency for AI computations
* Implement and fine-tune network protocols to enhance data transfer speeds and efficiency
* Integrate NVIDIA networking products with existing AI infrastructure, including servers, GPUs, and storage systems
* Deploy networking solutions in data centers to ensure seamless connectivity between AI components
* Diagnose and resolve networking issues impacting AI workloads to maintain optimal system performance
* Provide technical support and guidance to teams managing AI infrastructure
* Collaborate with data scientists, researchers, and IT professionals to understand networking requirements and challenges
* Lead deployment and validation of servers and systems for AI enabled platforms
* Configure and manage network topologies, BMC, OOB, TPM, power, and cooling
* Install, upgrade, and validate GPU-based servers, Blue Field DPUs, cables, and transceivers
* Perform firmware upgrades, hardware validation, and storage setup
* Configure and administer physical and logical resources, including M IG partitioning and Blue Field platforms
* Install and configure operating systems, cluster software, drivers, containers (Docker), and NGC CLI
* Manage and orchestrate clusters using NVIDIA Base Command Manager, Slurm, Pyxis, Enroot, and Run:
Ai
* Perform stress, benchmarking, and burn-in tests using HPL, NCCL, NVIDIA Nemo, and Cluster Kit
* Verify cabling, firmware/software versions, and network signal quality
* Troubleshoot and resolve hardware, software, storage, and performance faults
* Replace faulty components and optimize systems for AMD/Intel platforms
* Monitor, document, and report on cluster health, resource usage, and job performance
* Ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure, including user access and workload management Requirements:
* Qualified candidates must hold an active NVIDIA Professional Certification in either AI Networking, AI Infrastructure, or AI Operations
* Prior direct, hands-on professional experience administering NVIDIA GPU and data processing unit (DPU) technologies, AI software stacks, and data center environments for high-performance AI workloads
* Comprehensive expertise in deploying and maintaining AI compute platforms, requiring proficiency in containerization and workload orchestration using Docker, Kubernetes, Slurm, NVIDIA Base Command Manager, and Run:

Ai
* Must be capable of configuring physical and logical resources, including Multi-Instance GPU (MIG) partitioning and Blue Field platforms, while overseeing critical facility elements such as power, cooling, and storage solutions
* The ability to demonstrate advanced skills in AI networking, specifically configuring and optimizing high-performance Infini Band and Ethernet fabrics to ensure maximum throughput and minimal latency
* Current active TS/SCI clearance with a CI Polygraph Equal Opportunity Employer/Veterans/Disabled