Reinforcement Learning Hub

Roadmaps, systems, and core questions for RL research. This space is dedicated to the agents, reward models, and circuit-level effects of RL training.

Future resources and roadmap for RL

  • Redo Sutton and Barto
  • Spinning Up in Deep RL — OpenAI
  • Deep RL course by Hugging Face
  • DQN (Mnih et al. 2015), A3C (Mnih et al. 2016), PPO (Schulman et al. 2017), SAC (Haarnoja et al. 2018), TRPO (Schulman et al. 2015)
  • InstructGPT (Ouyang et al. 2022), Constitutional AI (Revisited), GRPO (DeepSeek 2024, replaces PPO in R1), DeepSeek R1 (2025), Let Me Think (various 2024-25)
  • Reinforcement Learning: Theory and Algorithms by Kakade
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov, Sharma, Mitchell, Ermon, Manning, Finn (Stanford, 2023). Lilian Weng's blog post "Alignment: RLHF and Beyond" covers DPO alongside PPO and RLHF.

Recommended timeline

Phase 1 (1-2 months):

  • Spinning Up implementation of PPO from scratch
  • Train on simple gym environment (CartPole → LunarLander)
  • Understand the code deeply, not just run it

Phase 2 (2-3 months):

  • Read GRPO paper
  • Implement GRPO on a small LLM (Qwen 1.5B or Gemma 2B)
  • Compare with your CAI experiments — same model, different training signal
  • Post on Scorpion Labs

Phase 3 (August onwards, combines with Evo2):

  • Process reward models on reasoning tasks
  • RL applied to genomic sequence models
  • Mechanistic interpretability of RL-trained models (how does the policy circuit differ from SL circuit?)

That last question — what does RL training do to the internal circuits a model uses? — is almost completely unstudied. BizzaroWorld found the factual recall circuit in a pretrained model. What happens to that circuit after RLHF? After CAI? That's a paper nobody has written yet and it sits exactly at the intersection of your four pillars.

Non-stationary environments and large state spaces

Self driving cars will be drawn on heavily because they face similar problems and are performing well. What can we learn about top tier systems that use RL to train agents to deal with this?

  1. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero) (Schrittwieser et al., DeepMind, 2019)
  2. Planning in Complex Objective Spaces (Sampled MuZero) (Hubert et al., DeepMind, 2021)
  3. Learning to Drive in a Day (Kendall et al., Wayve, 2018)
  4. Model-Based Imitation Learning for Urban Driving (MILE) (Wayve / Valeo, 2022)
  5. Mastering Diverse Domains through World Models (DreamerV3) (Hafner et al., 2023)
  6. Andrej Karpathy - CVPR 2021 Keynote on Tesla Autopilot
  7. Andrej Karpathy - Tesla AI Day 2021 / 2022
  8. Learning to Drive in a Day
  9. MILE
  10. World Models (Ha and Schmidhuber, 2018)

Waymo: HD maps + classical planning + learned components
Wayve: end-to-end learned world model, minimal priors
Tesla: massive supervised learning on human demonstrations + neural planner

Three different philosophies, all working at different levels. The comparison between them is itself a research paper waiting to be written from a mech interp perspective — what did each system learn, and how does the internal representation differ based on the training approach?