Inference Architecture Interns
Job in
San Jose, Santa Clara County, California, 95199, USA
Listed on 2026-06-17
Listing for:
Etched
Apprenticeship/Internship
position Listed on 2026-06-17
Job specializations:
-
Software Development
Python, AI Engineer (Applied/Software), Computer Software / Middleware, Backend Developer
Job Description & How to Apply Below
Inference Intern | Etched
The Tone:
This is an internship at Etched, located in San Jose, CA. Etched is building the world’s first AI inference system purpose‑built for transformers, aiming to deliver significantly higher performance and lower costs compared to existing solutions. This role is crucial for developing and optimizing compute architectures that achieve exceptional performance and efficiency for transformer workloads. Interns will contribute to the design of next‑generation AI accelerators, working on cutting‑edge architectural problems and performance modeling.
TheTL;
DR
- Role:
Internship - Type:
Temporary - Location:
In‑person, San Jose, CA - Mission:
Develop and optimize compute architectures that deliver exceptional performance and efficiency for transformer workloads. - Tech Stack:
Python, C++, Linux internals, accelerator architectures (GPUs, TPUs), Compilers, high‑speed interconnects (NVLink, Infini Band), vLLM, SGLang, Rust, PyTorch, JAX
- Model Porting:
Support porting state‑of‑the‑art models to the architecture and help build programming abstractions and high‑performance software components for rapid iteration. - Runtime Development:
Assist in building, enhancing, and scaling Sohu’s runtime, including multi‑node inference, intra‑node execution, state management, and robust error handling. - Communication Optimization:
Contribute to optimizing routing and communication layers using Sohu’s collectives. - Performance Analysis:
Utilize performance profiling and debugging tools to identify bottlenecks and correctness issues. - Architecture Co‑design:
Develop a deep understanding of Sohu to co‑design both hardware instructions and model architecture operations to maximize model performance.
- Background:
Student progressing towards a Bachelor’s, Master’s, or PhD degree in computer science, computer engineering, applied mathematics, or a related field. - Experience:
Understanding of performance‑sensitive or complex distributed software systems, such as Linux internals, accelerator architectures (e.g., GPUs, TPUs), Compilers, or high‑speed interconnects (e.g., NVLink, Infini Band), coupled with experience porting applications to non‑standard accelerator hardware or platforms. Deep knowledge of transformer model architectures and/or inference serving stacks like vLLM or SGLang is also required. - Skills:
Proficiency in Python and C++. - Bonus:
Proficiency in Rust, experience with low‑latency and high‑performance applications using kernel‑level and user‑space networking stacks, a deep understanding of distributed systems concepts, solid grasp of Transformer architectures (especially Mixture‑of‑Experts), experience building applications with extensive SIMD optimizations, familiarity with PyTorch or JAX, or participation in math competitions.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×