Research Engineer - LLM/VLM Inference Optimization; Seed Infra
Job in
Seattle, King County, Washington, 98113, USA
Listed on 2026-06-02
Listing for:
ByteDance
Full Time
position Listed on 2026-06-02
Job specializations:
-
IT/Tech
AI Engineer, Machine Learning/ ML Engineer, Data Scientist, Data Engineer
Job Description & How to Apply Below
About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models. Responsibilities
1. Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
2. Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
3. Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model tool chains and the broader technical ecosystem.
Minimum Qualifications:
1. Bachelor's degree or above in Computer Science, Electrical Engineering, Software Engineering, or a related field.
2. Strong proficiency in C/C++ and Python; solid foundations in algorithms, data structures, and systems programming; familiarity with containerization and server-side debugging.
3. Hands-on experience with at least one mainstream machine learning framework (e.g., PyTorch, Tensor Flow). 4. Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.
5. Familiarity with GPU architecture and experience optimizing compute-intensive operators (e.g., Flash Attention, GEMM, GEMV, Conv2D).
Preferred Qualifications:
1.
Experience with large-scale LLM serving infrastructure or equivalent production LLM deployment experience.
2. Experience in GPU programming (CUDA/OpenCL) and familiarity with frameworks such as Tensor
RT, Triton, or CUTLASS.
3. Experience in performance modeling, profiling, and optimization, or strong knowledge of CPU/GPU architectures.
4. Familiarity with model/data parallelism frameworks for distributed inference.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×