AI Infra Engineer
Listed on 2026-02-21
-
IT/Tech
Systems Engineer, AI Engineer, Cloud Computing
AI Infra Engineer
CTG is seeking to fill an AI Infra Engineer opening for our client in Morrisville, NC.
Location: Morrisville, NC
Duration: 12 months+ contract with ability to go long term
This role combines IT operations, hardware troubleshooting, and AI infrastructure expertise. Expect to handle day‑to‑day system administration, diagnose and resolve issues, and ensure optimal performance for ML workloads.
Key Responsibilities Hardware Management and Troubleshooting- Monitor and maintain GPU servers/workstations, including diagnosing and resolving hardware failures (e.g., GPU faults, power issues, cooling problems). Coordinate repairs, replacements, or upgrades as needed to ensure system uptime.
- Install, update, and configure CUDA drivers, Linux operating systems (e.g., Ubuntu or CentOS), and related dependencies. Ensure compatibility across hardware and software stacks for seamless ML operations.
- Run and analyze MLPerf benchmarks to evaluate system performance, identify bottlenecks, and optimize configurations for ML training tasks.
- Proactively monitor systems for issues, perform root‑cause analysis on failures or performance degradation, and implement fixes. This includes debugging kernel errors, network issues, or resource contention during LLM training.
- Implement best practices for security, backups, logging, and monitoring. Handle routine maintenance, such as firmware updates, patch management, and capacity planning for the GPU cluster.
- Proven experience (3+ years) in managing GPU‑accelerated servers or high‑performance computing (HPC) environments, preferably in AI/ML contexts.
- Strong knowledge of Linux system administration, including shell scripting, package management, and networking.
- Hands‑on experience with NVIDIA CUDA toolkit, drivers, and GPU hardware (e.g., A100, H100, or similar).
- Familiarity with ML benchmarking tools like MLPerf and frameworks such as Tensor Flow, PyTorch, or Hugging Face for LLM training.
- Ability to diagnose hardware and software issues using tools like nvidia‑smi, dmesg, top/htop, or Prometheus/Grafana for monitoring.
- Understanding of AI infrastructure ops, including containerization (Docker/Kubernetes) and orchestration for distributed training. Excellent problem‑solving skills with a proactive approach to preventing downtime.
- Experience with cluster management tools like Slurm, Kubernetes, or Ray for scaling ML workloads.
- Knowledge of hardware diagnostics for servers (e.g., IPMI, BIOS configuration, RAID setups).
- Background in IT operations with AI focus, such as Dev Ops for ML (MLOps).
- Certifications like RHCE (Red Hat Certified Engineer), NVIDIA certifications, or similar.
- Ability to work independently in a remote or on‑site setup, with strong communication skills for reporting issues.
Excellent verbal and written English communication skills and the ability to interact professionally with a diverse group are required.
CTG does not accept unsolicited resumes from headhunters, recruitment agencies, or fee‑based recruitment services for this role.
To ApplyPlease apply directly to this requisition using the link provided. For additional information, please contact Recruiter Jamie Robinson at
Equal Employment Opportunity StatementCTG will consider for employment all qualified applicants including those with criminal histories in a manner consistent with the requirements of all applicable local, state, and federal laws.
CTG is an Equal Opportunity Employer. CTG will assure equal opportunity and consideration to all applicants and employees in recruitment, selection, placement, training, benefits, compensation, promotion, transfer, and release of individuals without regard to race, creed, religion, color, national origin, sex, sexual orientation, gender identity and gender expression, age, disability, marital or veteran status, citizenship status, or any other discriminatory factors as required by law.
CTG is fully committed to promoting employment opportunities for members of protected classes.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).