AI and Systems Software Intern, At Scale AI - Fall
Listed on 2026-06-03
-
Software Development
Overview
NVIDIA is looking for an intern for an exciting role in AI and Systems Software for datacenter applications. You will be deeply involved in system-level debugging, analyzing large-scale infrastructure reliability, and correlating complex failure modes to underlying hardware or system issues. We work with the latest Accelerated Computing and Deep Learning software and hardware platforms, along with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions.
Our team interacts with OS, container technologies, GPU compute, and systems specialists to architect, develop and bring up large scale performance software components and optimize performance.
- Investigate and triage failures within large-scale compute clusters, performing deep-dive analysis to distinguish between software glitches, configuration errors, and hardware faults.
- Analyze logs and telemetry to correlate specific job failures to system-level issues and diagnostic test failures, helping to reduce noise and identify root causes.
- Assist with the tracking, calculation, and reporting on key reliability metrics, specifically Mean Time Between Failures (MTBF) and Mean Time Between Interruptions (MTBI), to drive infrastructure improvements.
- Assist in analyzing large-scale workload issues, searching for application and infrastructure improvement opportunities to ensure jobs run as fast and reliably as possible.
- Work closely with a mentor to learn about hardware validation suite architecture, document debugging methodologies, and help the team make intelligent, data-backed engineering decisions.
- Pursuing a BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field.
- Proficiency in Python and Bash/Shell scripting for automation and tool development.
- Proven debugging skills with an ability to isolate issues in complex, distributed systems.
- Exposure to high-performance computing (HPC) environments, cluster managers (e.g., Slurm, Kubernetes), or large-scale distributed systems.
- Familiarity with server architecture (PCIe, NVLink, CPU/GPU interactions) and hardware diagnostics.
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Familiarity with system profiling and debugging tools (e.g., strace, gdb, perf).
- Experience running and analyzing standard industry benchmarks on Linux systems.
- Desire to learn and be part of a committed and hardworking team with excellent collaboration and communication skills.
- Ability to multitask effectively in a dynamic, high-performance environment.
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you re creative and autonomous, we want to hear from you!
Our internship hourly rates are a standard pay based on the position, your location, year in school, degree, and experience. The hourly rate for our interns is 20 USD - 71 USD.
You will also be eligible for Intern benefits.
Applications for this job will be accepted at least until May 31, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
JR2018652
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).