Post-Training — Engineer/Algorithm Researcher Job Menlo Park area,California USA,IT/Tech

Position: Post-Training — Engineer / Algorithm Researcher [33248]

Post-training algorithm R&D. Own the full post-training pipeline for the coding agent—supervised fine-tuning (SFT), reward modeling, and reinforcement learning (RLHF/DPO/GRPO/PPO, etc.)—continuously improving code generation, debugging, and multi-step reasoning on real software-engineering tasks.
Verifiable rewards & agentic RL. Design reward mechanisms based on verifiable signals (unit tests, compile/execution results, static checks, etc.) for coding scenarios (RLVR); build a multi-turn agentic RL training paradigm with tool-call and execution-feedback loops, improving success rate and stability on long-horizon tasks.
Evaluation-model training. Develop evaluation/judge models for coding tasks (LLM-as-a-Judge, generative reward models, critic/verifier models, etc.); use post-training to give them highly consistent judgment of code correctness, executability, and quality; continuously improve alignment with human annotation and verifiable signals to reduce evaluation bias and noise.
Data & reward-signal engineering. Lead the construction and governance of post-training data—preference-data collection, synthetic-data generation, difficulty grading, and quality filtering; identify and mitigate reward hacking and distribution drift to keep training and evaluation signals reliable.
Training–evaluation loop. Partner with the evaluation team to build an end-to-end evaluation system for coding agents (SWE-bench-style benchmarks, in-house task sets); feed results back into post-training iteration to create a fast experiment–verify–converge cadence.
Training at scale. Work closely with the infra team to land RL training efficiently on large clusters; optimize the coordination of rollout sampling, inference engines (vLLM/SGLang), and the training framework to raise overall throughput and sample efficiency.

Qualifications

Education. Bachelor's degree or above in CS, AI, Mathematics, Statistics, or a related field;
Master's/PhD preferred.
Post-training experience. Deep understanding of the LLM post-training stack; complete hands-on experience in at least one of SFT, RLHF, DPO/GRPO/PPO, or reward modeling; able to independently run the full experiment loop from data to training to evaluation.
Evaluation-model experience. Understanding of reward-model / judge-model training and evaluation; familiarity with LLM-as-a-Judge, pairwise/pointwise scoring, and verifier paradigms; experience with evaluation consistency, calibration, and bias analysis a plus.
RL foundations. Solid grasp of RL fundamentals (policy gradients, value functions, advantage estimation, etc.); experience with stability, sample efficiency, and hyperparameter tuning of RL training in the LLM setting.
Engineering ability. Proficient in Python with a solid foundation in data structures and algorithms; skilled with PyTorch and real usage or secondary development experience with mainstream post-training/RL frameworks (TRL, veRL, OpenRLHF, Deep Speed-Chat, etc.).
Coding-domain understanding. Understanding of code generation and software-engineering task characteristics; able to build effective training and evaluation signals around verifiable rewards, sandboxed execution, and test-case design.
Research & debugging. Able to read and reproduce frontier papers; strong analysis and diagnosis of training-curve anomalies, reward collapse, model degradation, and evaluation drift.

#J-18808-Ljbffr

Post-Training — Engineer​/Algorithm Researcher

Post-Training — Engineer/Algorithm Researcher