Senior NPU Kernel/Operator Engineer Job San Jose area,California USA,Engineering

Viridan Group has partnered with a cutting edge Autonomous Driving chip design company who are looking for a Senior NPU Kernel/Operator Engineer to lead the design and optimization of high-performance kernels for a custom AI accelerator / NPU. This role focuses on general-purpose deep learning operators, fused kernels, and hardware-aware performance optimization across CNNs, transformers, and other neural network workloads. The ideal candidate has strong experience in performance engineering on GPU, NPU, DSP, CPU SIMD, compiler backend, embedded accelerator, or HPC systems.

Responsibilities

Design and optimize high-performance NPU kernels for a broad range of neural network workloads.
Own critical operators such as attention-style kernels, normalization, reduction, layout conversion, gather/scatter, quant/dequant, and fused operators.
Develop tiling, blocking, vectorization, and memory scheduling strategies.
Optimize data movement across matrix engine, vector engine, SRAM, DMA, NoC, cache, and DRAM.
Analyze bottlenecks in compute utilization, memory bandwidth, synchronization, DMA overlap, bank conflicts, and instruction overhead.
Build first-principles performance models for key operators.
Drive kernels toward hardware roofline limits.
Collaborate with hardware, compiler, runtime, and model teams on ISA features, tensor layouts, memory access patterns, and operator APIs.
Debug complex correctness, precision, and performance issues on simulator or silicon.
Mentor junior engineers and establish kernel optimization best practices.

Qualifications

BS/MS/PhD in CS, EE, Computer Engineering, or related field.
5+ years of experience in performance optimization, accelerator programming, GPU/NPU/DSP development, compiler backend, embedded systems, or HPC.

Required Skills

Deep understanding of memory hierarchy, tiling, parallelism, vectorization, synchronization, and bandwidth analysis.
Experience optimizing performance-critical kernels or numerical computation.
Ability to reason from algorithm requirements to hardware execution and performance bottlenecks.

Preferred Skills

Experience with CUDA, Triton, CUTLASS, OpenCL, TVM, MLIR, Halide, SIMD intrinsics, DSP SDKs, or custom accelerator SDKs.
Experience optimizing operators such as convolution, GEMM, attention, softmax, normalization, reduction, image processing, or fused compute/memory kernels.
Familiarity with custom AI accelerator architecture, matrix engines, vector engines, systolic arrays, DMA, SRAM, NoC, or DRAM systems.
Experience with mixed precision and quantization: FP32, FP16, BF16, FP8, INT8, INT
4.
Experience with simulator/emulator/FPGA/silicon bring-up is a plus.

#J-18808-Ljbffr

Senior NPU Kernel​/Operator Engineer

Senior NPU Kernel/Operator Engineer