31
Oct
How can a small model learn to solve tasks it currently fails at, without rote imitation or relying on a correct rollout? A team of researchers from Google Cloud AI Research and UCLA have released a training framework, ‘Supervised Reinforcement Learning’ (SRL), that makes 7B scale models actually learn from very hard math and agent trajectories that normal supervised fine tuning and outcome based reinforcement learning RL cannot learn from. Small open source models such as Qwen2.5 7B Instruct fail on the hardest problems in s1K 1.1, even when the teacher trace is good. If we apply supervised fine tuning…
