Researcher, Post-Training
Listed on 2026-06-07
-
Research/Development
Data Scientist
ABOUT THE COMPANY
We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site.
ABOUTTHE ROLE
You'll lead our work on model post-training: supervised fine-tuning, preference data, reinforcement learning from human and AI feedback, reward modeling, and the evaluation suites that tell us what's actually working. You'll own a research area that meaningfully shapes our model behavior and capability.
This is a hands‑on senior research role. You'll set direction, run experiments, and ship into production. You'll partner with the data, infrastructure, and engineering teams to make the post‑training pipeline reliable and fast: improvements there compound into every model we ship.
WHAT YOU'LL DO- Lead post‑training research: SFT, RLHF/RLAIF, RLVR, DPO and successor methods, reward modeling, preference data design
- Design and curate the data that goes into post‑training (from sourcing, to filtering, to quality assessment)
- Build and maintain the evaluation suites that measure what matters; resist Goodharting your own benchmarks
- Run rigorous experiments (controls, ablations, statistical significance) and write up internal findings clearly
- Scale data pipelines and the infrastructure team to scale training
- Identify and characterize failure modes (reward hacking, distribution drift, eval saturation) and design experiments to address them
- Stay current on the post‑training literature; bring useful methods in, ignore the noise
- Strong track record of post‑training research (SFT, RL, reward modeling) at a frontier‑model lab or equivalent
- 5+ years of hands‑on ML research experience
- Comfort with large‑scale data curation and preference‑data pipelines
- Experience designing evaluation suites for capabilities that aren't easily benchmarked
- Fluent in PyTorch or equivalent; comfortable at the scale of distributed training
- Strong statistical instincts: you'd notice a flawed comparison before someone else points it out
- Strong written communication
- PhD in ML, statistics, CS, or adjacent
- Published research at NeurIPS, ICML, ICLR, COLM, RLC, or comparable venues
- Experience with reward hacking detection, scaling reward models, or RLHF infrastructure
- Synthetic data generation experience
- Background in RL math (policy gradients, importance sampling, off‑policy methods)
- Open‑source contributions to post‑training infrastructure
- You are primarily interested in pretraining (that's a different role)
- You would rather invent novel methods in isolation than ship them into a model that real users run
- You prefer benchmarks that are stable to evaluation work where the right answer isn't yet defined
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).