Compiler Engineer - PyTorch + Kernel DSLPLATE Job San Jose area,California USA,Software Development

Position: Staff Compiler Engineer - PyTorch + Kernel DSLPLATE

To provide the best candidate experience amidst our high application volumes, each candidate is limited to 10 applications across all open jobs within a 6-month period.

Advancing the World’s Technology Together

Our technology solutions power the tools you use every day— including smartphones, electric vehicles, hyperscale data centers, IoT devices, and so much more. Here, you’ll have an opportunity to be part of a global leader whose innovative designs are pushing the boundaries of what’s possible and powering the future.

We believe innovation and growth are driven by an inclusive culture and a diverse workforce. We’re dedicated to empowering people to be their true selves. Together, we’re building a better tomorrow for our employees, customers, partners, and communities.

The AGI (Artificial General Intelligence) Computing Lab is dedicated to solving the complex system-level challenges posed by the growing demands of future AI/ML workloads. Our team is committed to designing and developing scalable platforms that can effectively handle the computational and memory requirements of these workloads while minimizing energy consumption and maximizing performance. To achieve this goal, we collaborate closely with both hardware and software engineers to identify and address the unique challenges posed by AI/ML workloads and to explore new computing abstractions that can provide a better balance between the hardware and software components of our systems.

Additionally, we continuously conduct research and development in emerging technologies and trends across memory, computing, interconnect, and AI/ML, ensuring that our platforms are always equipped to handle the most demanding workloads of the future. By working together as a dedicated and passionate team, we aim to revolutionize the way AI/ML applications are deployed and executed, ultimately contributing to the advancement of AGI in an affordable and sustainable manner.

Join us in our passion to shape the future of computing!

Location: Daily onsite presence at our San Jose, CA office / U.S. headquarters in alignment with our Flexible Work policy.

What You’ll Do

Adapting torch.compile to our backend: lowering Inductor's IR to our hardware, defining what gets fused, what gets specialized, and where the compiler should yield to hand‑written kernels.
Building or extending kernel DSLs for our hardware: taking a tile‑based programming model (Triton‑style), a higher‑level expression (Helion‑style), or a custom DSL we design, and lowering it to our ISA, our memory hierarchy, and our collective primitives.
Designing placement and scheduling passes: given a graph and our distributed memory model, deciding where tensors live, when to migrate them, and how to overlap compute with data movement.
Implementing parallelism‑aware lowering: making tensor, pipeline, expert, and sequence parallelism first‑class in the compiler IR rather than bolted on at the framework layer.
Fusion, tiling, and memory planning: the classical compiler problems, reframed for a non‑uniform memory hierarchy where the right tile size and the right placement are coupled decisions.
Upstream contributions: where we use open‑source DSLs, we want our work to land upstream rather than live in a private fork. You'll engage with upstream review processes for PyTorch, Triton, Helion, and adjacent projects.

What You Bring

Bachelor’s with 10+ years, or Master’s with 8+ years, or PhD's with 5+ years of industry experience.
3‑5+ years of industry experience in at least one of:
Triton, Helion, MLIR, XLA, TVM, Inductor, IREE, CUTLASS, or a proprietary equivalent (More experienced candidates will also be considered at relevant levels).
Experience designing a kernel DSL or its IR from scratch, or making non‑trivial language‑level changes to an existing one.
Experience with MLIR — writing dialects, passes, or backend integration.
Experience building PyTorch backends for non‑CUDA accelerators (XPU, ROCm, MPS, TPU, custom).
Experience with kernel autotuning, performance modeling, or cost‑based compilation.
Background in HPC, distributed systems, or NUMA‑aware programming — anything that built intuition for non‑flat memory.
Op…