Senior ML Accelerator Engineer - GPU Job Sunnyvale area,California USA,Software Development

GM’s vision of Zero Crashes, Zero Emissions, and Zero Congestion guides everything we do in autonomous and assisted driving. The AV organization is building advanced automated driving technologies, including Level 4–capable fully self-driving systems, to move us toward safer, more sustainable, and more accessible mobility.

For the AI Kernels & Compilers team, that mission shows up in the details: turning cutting‑edge perception, prediction, and planning research into production‑grade software that can run efficiently and reliably on real vehicles pioneer new approaches to model export, kernel development, and performance engineering so that every cycle on our accelerators translates into better situational awareness, faster reaction times, and more robust behavior on the road.

If you want your compiler and kernels work to directly influence how automated vehicles understand and react to the world — while operating at the safety, reliability and scale of a company like GM — this is where that impact becomes real.

About the Team

The AI Kernels team builds high‑performance GPU kernels and custom libraries that sit at the heart of our on‑vehicle ML inference for ADAS and autonomous driving. We own making core AI workloads faster, more reliable, and easier to maintain and deploy on real cars, under real‑world constraints.

That means:

Designing and implementing custom operators when vendor libraries hit their limits

Integrating those kernels deep into our ML runtime stack

Debugging and tuning GPU performance across the AV software stack, often on hardware‑in‑the‑loop

We partner closely with AI Solutions, AI Compilers, AI Architecture, and AI Tooling to ensure models deploy efficiently to the car while consistently meeting strict latency, throughput, and reliability targets. If you enjoy pushing GPUs to their limits and seeing your work directly impact how autonomous vehicles perceive and act in the world, this is the team for you.

What you’ll be doing (Responsibilities)

Design, implement, benchmark, and iterate on CUDA-based kernels and custom operators to squeeze every last drop of performance out of on-vehicle inference workloads.

Build and improve tooling and infrastructure that make it easier to profile, debug, and validate CUDA kernels and accelerator-backend code across the AV stack.

Partner with AI Solutions, Compilers, and Architecture to translate model and system requirements into concrete kernel roadmaps, priorities, and project plans.

Collaborate with cross-functional teams (compiler, performance tooling, runtime, deployment solutions) to deliver reusable, reliable, high-performance libraries into production.

Maintain high technology standards, methodologies, processes, and guidelines for GPU kernel development and performance engineering through code review.

Manage relationships with internal customers to ensure our kernels and libraries meet real-world needs

Your Skills & Abilities (Required Qualifications)
Minimum 2+ years of relevant industry experience or equivalent experience

BS, MS or PhD in CS, or related technical field

Excellent GPU programming skills in CUDA, with a thorough understanding of parallel programming patterns and GPU architecture.

Hands‑on experience benchmarking, profiling, debugging and optimizing accelerator libraries and kernels to extract optimal performance using the NSight suite of tools or similar.

Strong background in software architecture, library design, and design patterns.

Strong C++ programming skills with the ability to feel comfortable in large codebases.

Solid background in system performance, high performance computing and/or architecture‑aware optimizations.

Strong communication skills and the ability to work collaboratively within a team

Excellent analytical and problem‑solving skills

What Will Give You A Competitive Edge (Preferred Qualifications)2+ years of relevant industry experience or equivalent experience

Experience with tensor core programming, CUTLASS and/or Cu Te Experience with ML model architectures, in particular transformer‑based

Experience with low latency or real time systems

Experience with lower levels of an accelerator software stack (i.e. drivers,…