Post-Training — Engineer/Algorithm Researcher
Job in
Menlo Park, San Mateo County, California, 94029, USA
Listed on 2026-06-24
Listing for:
Stealth Startup
Apprenticeship/Internship
position Listed on 2026-06-24
Job specializations:
-
IT/Tech
Machine Learning/ ML Engineer, AI Evaluation, Data Scientist
Job Description & How to Apply Below
- Post-training algorithm R&D. Own the full post-training pipeline for the coding agent—supervised fine-tuning (SFT), reward modeling, and reinforcement learning (RLHF/DPO/GRPO/PPO, etc.)—continuously improving code generation, debugging, and multi-step reasoning on real software-engineering tasks.
- Verifiable rewards & agentic RL. Design reward mechanisms based on verifiable signals (unit tests, compile/execution results, static checks, etc.) for coding scenarios (RLVR); build a multi-turn agentic RL training paradigm with tool-call and execution-feedback loops, improving success rate and stability on long-horizon tasks.
- Evaluation-model training. Develop evaluation/judge models for coding tasks (LLM-as-a-Judge, generative reward models, critic/verifier models, etc.); use post-training to give them highly consistent judgment of code correctness, executability, and quality; continuously improve alignment with human annotation and verifiable signals to reduce evaluation bias and noise.
- Data & reward-signal engineering. Lead the construction and governance of post-training data—preference-data collection, synthetic-data generation, difficulty grading, and quality filtering; identify and mitigate reward hacking and distribution drift to keep training and evaluation signals reliable.
- Training–evaluation loop. Partner with the evaluation team to build an end-to-end evaluation system for coding agents (SWE-bench-style benchmarks, in-house task sets); feed results back into post-training iteration to create a fast experiment–verify–converge cadence.
- Training at scale. Work closely with the infra team to land RL training efficiently on large clusters; optimize the coordination of rollout sampling, inference engines (vLLM/SGLang), and the training framework to raise overall throughput and sample efficiency.
- Education. Bachelor's degree or above in CS, AI, Mathematics, Statistics, or a related field;
Master's/PhD preferred. - Post-training experience. Deep understanding of the LLM post-training stack; complete hands-on experience in at least one of SFT, RLHF, DPO/GRPO/PPO, or reward modeling; able to independently run the full experiment loop from data to training to evaluation.
- Evaluation-model experience. Understanding of reward-model / judge-model training and evaluation; familiarity with LLM-as-a-Judge, pairwise/pointwise scoring, and verifier paradigms; experience with evaluation consistency, calibration, and bias analysis a plus.
- RL foundations. Solid grasp of RL fundamentals (policy gradients, value functions, advantage estimation, etc.); experience with stability, sample efficiency, and hyperparameter tuning of RL training in the LLM setting.
- Engineering ability. Proficient in Python with a solid foundation in data structures and algorithms; skilled with PyTorch and real usage or secondary development experience with mainstream post-training/RL frameworks (TRL, veRL, OpenRLHF, Deep Speed-Chat, etc.).
- Coding-domain understanding. Understanding of code generation and software-engineering task characteristics; able to build effective training and evaluation signals around verifiable rewards, sandboxed execution, and test-case design.
- Research & debugging. Able to read and reproduce frontier papers; strong analysis and diagnosis of training-curve anomalies, reward collapse, model degradation, and evaluation drift.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×