×
Register Here to Apply for Jobs or Post Jobs. X

Research Engineer - LLM​/VLM Inference Optimization; Seed Infra

Job in Seattle, King County, Washington, 98113, USA
Listing for: ByteDance
Full Time position
Listed on 2026-06-02
Job specializations:
  • IT/Tech
    AI Engineer, Machine Learning/ ML Engineer, Data Scientist, Data Engineer
Job Description & How to Apply Below
Position: Research Engineer - LLM/VLM Inference Optimization (Seed Infra)
About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models. Responsibilities
1. Design, develop, and optimize high-performance inference systems for large-scale LLMs and VLMs, covering inference engines, serving frameworks, and end-to-end deployment pipelines.
2. Build state-of-the-art model inference engines through advanced performance optimization techniques such as compiler-level optimizations, parallel computing, graph fusion, efficient CUDA kernel development, low-precision computation, streaming inference, speculative decoding, and high-concurrency request optimization.
3. Collaborate closely with other research teams to identify performance bottlenecks, conduct in-depth performance analysis, and optimize large models; contribute to the development of model tool chains and the broader technical ecosystem.

Minimum Qualifications:

1. Bachelor's degree or above in Computer Science, Electrical Engineering, Software Engineering, or a related field.
2. Strong proficiency in C/C++ and Python; solid foundations in algorithms, data structures, and systems programming; familiarity with containerization and server-side debugging.
3. Hands-on experience with at least one mainstream machine learning framework (e.g., PyTorch, Tensor Flow). 4. Experience deploying or optimizing LLM/VLM inference at production scale, with demonstrated impact on latency, throughput, or serving cost.
5. Familiarity with GPU architecture and experience optimizing compute-intensive operators (e.g., Flash Attention, GEMM, GEMV, Conv2D).

Preferred Qualifications:

1.

Experience with large-scale LLM serving infrastructure or equivalent production LLM deployment experience.
2. Experience in GPU programming (CUDA/OpenCL) and familiarity with frameworks such as Tensor

RT, Triton, or CUTLASS.
3. Experience in performance modeling, profiling, and optimization, or strong knowledge of CPU/GPU architectures.
4. Familiarity with model/data parallelism frameworks for distributed inference.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary