More jobs:
Senior NPU Kernel/Operator Engineer
Job in
San Jose, Santa Clara County, California, 95199, USA
Listed on 2026-06-14
Listing for:
Viridan Group
Full Time
position Listed on 2026-06-14
Job specializations:
-
Engineering
Systems Engineer, Hardware Engineer
Job Description & How to Apply Below
Viridan Group has partnered with a cutting edge Autonomous Driving chip design company who are looking for a Senior NPU Kernel/Operator Engineer to lead the design and optimization of high-performance kernels for a custom AI accelerator / NPU. This role focuses on general-purpose deep learning operators, fused kernels, and hardware-aware performance optimization across CNNs, transformers, and other neural network workloads. The ideal candidate has strong experience in performance engineering on GPU, NPU, DSP, CPU SIMD, compiler backend, embedded accelerator, or HPC systems.
Responsibilities
- Design and optimize high-performance NPU kernels for a broad range of neural network workloads.
- Own critical operators such as attention-style kernels, normalization, reduction, layout conversion, gather/scatter, quant/dequant, and fused operators.
- Develop tiling, blocking, vectorization, and memory scheduling strategies.
- Optimize data movement across matrix engine, vector engine, SRAM, DMA, NoC, cache, and DRAM.
- Analyze bottlenecks in compute utilization, memory bandwidth, synchronization, DMA overlap, bank conflicts, and instruction overhead.
- Build first-principles performance models for key operators.
- Drive kernels toward hardware roofline limits.
- Collaborate with hardware, compiler, runtime, and model teams on ISA features, tensor layouts, memory access patterns, and operator APIs.
- Debug complex correctness, precision, and performance issues on simulator or silicon.
- Mentor junior engineers and establish kernel optimization best practices.
Qualifications
- BS/MS/PhD in CS, EE, Computer Engineering, or related field.
- 5+ years of experience in performance optimization, accelerator programming, GPU/NPU/DSP development, compiler backend, embedded systems, or HPC.
Required Skills
- Deep understanding of memory hierarchy, tiling, parallelism, vectorization, synchronization, and bandwidth analysis.
- Experience optimizing performance-critical kernels or numerical computation.
- Ability to reason from algorithm requirements to hardware execution and performance bottlenecks.
Preferred Skills
- Experience with CUDA, Triton, CUTLASS, OpenCL, TVM, MLIR, Halide, SIMD intrinsics, DSP SDKs, or custom accelerator SDKs.
- Experience optimizing operators such as convolution, GEMM, attention, softmax, normalization, reduction, image processing, or fused compute/memory kernels.
- Familiarity with custom AI accelerator architecture, matrix engines, vector engines, systolic arrays, DMA, SRAM, NoC, or DRAM systems.
- Experience with mixed precision and quantization: FP32, FP16, BF16, FP8, INT8, INT
4. - Experience with simulator/emulator/FPGA/silicon bring-up is a plus.
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×