Principal Engineer – Distributed AI Systems Architecture; Heterogeneous Compute
Listed on 2026-05-31
-
Engineering
AI Engineer, Systems Engineer, Software Engineer
Job Overview
We are seeking a Principal Engineer to define and architect the next generation of distributed AI systems across heterogeneous compute platforms, including CPUs, GPUs, IPUs, and emerging dataflow accelerators. This role focuses on dynamically executing and optimizing large‑scale AI computation graphs across diverse hardware while managing state, locality, and performance at system scale.
Key Responsibilities- Dynamic Execution of Distributed Computation Graphs
- Define a runtime model for executing AI workloads as distributed computation graphs across heterogeneous resources.
- Design abstractions for graph representation, dependencies, and execution semantics.
- Enable dynamic scheduling and execution across CPUs, GPUs, and specialized accelerators.
- Stateful Scheduling and Memory‑Centric Architecture
- Architect systems where state (e.g., KV cache) is a first‑class concern in scheduling and execution.
- Define models for data locality, memory hierarchy, and state ownership for distributed inferencing.
- Optimize for minimal data movement and efficient access to distributed state.
- Graph Introspection and Automated Partitioning
- Develop mechanisms to analyze AI computation graphs and classify stages by compute intensity, memory bandwidth, communication cost, and latency sensitivity.
- Drive automated or semi‑automated partitioning of workloads across heterogeneous compute resources.
- Integration of Specialized Accelerators
- Architect frameworks that treat specialized accelerators as first‑class execution targets.
- Define execution boundaries, data exchange models, and integration strategies across device classes.
- Enable interoperability across diverse compute paradigms without sacrificing performance.
- MoE‑Aware Execution and Adaptive Placement
- Design runtime strategies for Mixture‑of‑Experts models including expert placement, routing locality, and load balancing vs. data movement trade‑offs.
- Enhance existing frameworks for MoE and optimize communication paths with IPUs and Intel accelerators.
- Enable adaptive execution based on real‑time system signals (latency, utilization, skew).
- Adaptive Runtime and Feedback‑Driven Optimization
- Define observability and telemetry models for distributed AI execution.
- Build feedback loops that continuously optimize placement, scheduling, and resource utilization.
- Drive system‑level performance across latency, throughput, and efficiency metrics.
- Minimum Qualifications:
- Bachelor's or equivalent degree in Computer Science, Software Engineering, or related field.
- 12+ years of experience with a Bachelor's degree.
- Proven expertise in defining and implementing software architectures for AI frameworks, protocols, and algorithms.
- Deep experience in systems architecture, high‑performance computing, or distributed systems.
- Strong background in parallel or data‑parallel computation models.
- Experience with heterogeneous compute environments (CPU, GPU, DSP, or accelerators).
- Proven ability to design end‑to‑end systems from abstraction through implementation.
- Strong understanding of performance trade‑offs across compute, memory, and interconnect.
- Preferred Qualifications:
- 8+ years of experience with a master’s degree, or 6+ years with a PhD.
- Experience with AI/ML systems, inference infrastructure, or large‑scale model serving.
- Familiarity with stream processing, dataflow models, or graph execution systems.
- Knowledge of modern AI frameworks or runtimes.
- Experience building developer‑facing SDKs or programming models.
- Background in performance optimization and benchmarking.
Job Type: Experienced Hire
Shift: Shift 1 (United States of America)
Primary
Location:
US, California, Santa Clara
Additional Locations: US, Oregon, Hillsboro; US, Texas, Austin
All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genetic information, military and veteran status, marital status, pregnancy, gender, gender expression, gender identity, sexual orientation, or any other characteristic protected by local law, regulation, or ordinance.
Position of TrustThis role is a Position of Trust. Candidates must consent to and pass an extended background investigation, which includes education, SEC sanctions, and additional criminal and civil checks (subject to country law). For internal applicants, the investigation may or may not be completed before starting the position.
Benefits and CompensationWe offer a total compensation package that ranks among the best in the industry, including competitive pay, stock bonuses, and benefits covering health, retirement, and vacation. Annual salary range for this role in the US: $ – $.
Work ModelThis role requires on‑site presence.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).