hlfshell

#deepseek

DeepSeek + Inference-Time Scaling and Generalist Reward Modeling

DeepSeek released another cool paper expanding on reinforcement learning for LLM alignment. Building off of their prior work (which I talk about here), they introduce two new methods.

The first - Rejective Fine-Tuning (RFT) - They have a pre-trained model produce N responses. Then the collected responses are combined in a prompt wherein the model is instructed to produce principles for evaluating the responses, critiques of each response, and a reward score for each based on the generated principles. The process utilizes a human judge to critique its critiques, teaching the model to produce these evaluations eventually without human feedback.

The goal of this is to move the reward function from a simple scalar value to a language derived scalar value. If the model is being trained on domains that don’t translate well to automatic correctness, this improves performance. This is why reasoning models are generally focused to this point on mathematics, coding, and similar domains - there’s easy and clear ground truth results to evaluate against. However, in more subjective domains (like ethics, open-ended questions, etc.), a scalar value is not easily derived without human intervention.

Once the model is proficient at generating these evaluations, the second method - Self-Principled Critique Tuning (SPCT) - is introduced. This method uses reinforcement learning to adaptively improve the model’s ability to generate critiques and principles without human intervention. A sort of feedback loop now begins, wherein the model generates its own evaluation criteria, critiques, and responses, assigns scores, and then receives rewards based on how well it ranks the responses against known human preferences. GRPO is still used for policy optimization, but the model is learning without direct human ratings or real time human raters.

Finally, they introduce Inference-Time Sampling and Voting to further enhance the robustness of the reward model during inference by sampling multiple outputs and aggregating them through a voting mechanism. Essentially - have a language model generate multiple responses, then have the trained reward model judge each response. Finally, choose the highest rated one at inference time to improve performance. It feels like a great improvement to self-consistency mechanisms.

#AI #deepseek