More jobs:
AI Infra Engineer
Job in
Morrisville, Bucks County, Pennsylvania, 19067, USA
Listed on 2026-02-20
Listing for:
CTG (Computer Task Group, Inc.)
Full Time
position Listed on 2026-02-20
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support
Job Description & How to Apply Below
Location:
Morrisville, NC Duration: 12 months+ contract with ability to go long term This role combines IT operations, hardware troubleshooting, and Al infrastructure expertise. Expect to handle day-to-day system administration, diagnose and resolve issues, and ensure optimal performance for ML workloads.
Key Responsibilities:
Hardware Management and Troubleshooting:
Monitor and maintain GPU servers/workstations, including diagnosing and resolving hardware failures (e.g., GPU faults, power issues, cooling problems). Coordinate repairs, replacements, or upgrades as needed to ensure system uptime. Software and Driver Management:
Install, update, and configure CUDA drivers, Linux operating systems (e.g., Ubuntu or CentOS), and related dependencies. Ensure compatibility across hardware and software stacks for seamless ML operations. Performance Benchmarking:
Run and analyze MLPerf benchmarks to evaluate system performance, identify bottlenecks, and optimize configurations for ML training tasks. System Diagnostics and Problem Resolution:
Proactively monitor systems for issues, perform root-cause analysis on failures or performance degradation, and implement fixes. This includes debugging kernel errors, network issues, or resource contention during LLM training. General Infrastructure Ops:
Implement best practices for security, backups, logging, and monitoring. Handle routine maintenance, such as firmware updates, patch management, and capacity planning for the GPU cluster.
Minimum Requirements:
Proven experience (3+ years) in managing GPU-accelerated servers or high-performance computing (HPC) environments, preferably in Al/ML contexts. Strong knowledge of Linux system administration, including shell scripting, package management, and networking. Hands-on experience with NVIDIA CUDA toolkit, drivers, and GPU hardware (e.g., A100, H100, or similar). Familiarity with ML benchmarking tools like MLPerf and frameworks such as Tensor Flow, PyTorch, or Hugging Face for LLM training.
Ability to diagnose hardware and software issues using tools like nvidia-smi, dmesg, top/htop, or Prometheus/Grafana for monitoring. Understanding of Al infrastructure ops, including containerization (Docker/Kubernetes) and orchestration for distributed training. Excellent problem-solving skills with a proactive approach to preventing downtime.
Preferred Qualifications:
Experience with cluster management tools like Slurm, Kubernetes, or Ray for scaling ML workloads. Knowledge of hardware diagnostics for servers (e.g., IPMI, BIOS configuration, RAID setups). Background in IT operations with Al focus, such as Dev Ops for ML (MLOps). Certifications like RHCE (Red Hat Certified Engineer), NVIDIA certifications, or similar. Ability to work independently in a remote or on-site setup, with strong communication skills for reporting issues.
Excellent verbal and written English communication skills and the ability to interact professionally with a diverse group are required. CTG does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services for this role.
To Apply:
To be considered, please apply directly to this requisition using the link provided. For additional information, please contact Recruiter Jamie Robinson at About CTG CTG, a Cegeka company, is at the forefront of digital transformation, providing IT and business solutions that accelerate project momentum and deliver desired value. Over nearly 60 years, we have earned a reputation as a faster and more reliable, results-driven partner.
Our vision is to be an indispensable partner to our clients and the preferred career destination for digital and technology experts. CTG leverages the expertise of over 9,000 team members in 19 countries to provide innovative solutions. Together, we operate across the Americas, Europe, and India, working in close cooperation with over 3,000 clients in many of today's highest-growth industries.
For more information, visit Our culture is a direct result of the people who work at CTG, the values we hold, and the actions…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×