hlfshell

DeepSeek GRM and SPCT - Complex Domain Rewards

Recently, I gave a talk on several of DeepSeek’s innovations, which were as extensive as they were complicated. The particular clever discovery that best captured my imagination was their development of GRM and SPCT.

Most AI models — be they “simply” instruction-based or reasoning — tend to focus on coding or mathematics domains. This makes sense; when training you need to provide some form of signal for backpropagation. Typically, this is done as pairwise - A > B amongst some number of generated outputs - or as scalar values, typically {0, 1} as a pass/fail. Math and coding prompts have some testable, verifiable expected output, and thus we can, at scale and automated, determine if the model is achieving the goal we set for it. It would take an impossible legion of domain experts to review and score outputs during training.

Then there’s the problem of partial scoring. Despite using a scalar value we struggle to apply anything beyond the pass/fail of 0 or 1. It is an uneviable task to create reasonably standard or effective scoring rubrics for any singular complex domain, let alone across the infinite continuum of tasks that we expect these models to tackle. A more concrete example from my own field, robotics - if I was training a robot to throw a basketball through a hoop it is all but impossible to score the robot in any manner beyond pass/fail; both from the difficulty of determining how far short the throw is, arc and ball’s spin, and the hardcore of actually physically tracking the ball.

To solve this, DeepSeek sought to create a model (Generative Reward Model, GRM) designed to handle the generation of critiques. A more sophisticated take on Actor/Critic.

We need to develop a methodology of determining what is important when rating a response and any criterion we decide upon is going to be wholly unique to a singular prompt/response. A language model is fine-tuned to generate a set of specific criteria - 2-4 categories to describe, analyze, assess. These are highly specific based on the user’s likely expectations from their prompts. These criteria are likely highly complex, and impossible to numerically rate without something at least approximating a workable “facade” of intuition. Not all critique factors into a rating and are equal. Not only are the criteria generated, but a weight is also assigned to each criterion.

The model then produces an analysis - a natural language review of the prompt per the generated criteria. I suspect this is done in the same manner I often prompt models to critique their reasoning prior to generating the answer to improve performance. My current understanding (intuition?) is that the self-attention heads are “primed” to act within a high-dimensional space adjacent to the correct solution.

Finally the model assigns a numerical score to each target paired with a justification of that score. Finally the overall score is rebalanced by the weight of the criteria.

$$\sum_{i} w_i s_i$$

Once the model is trained we can use it to create singular or paired pointwise reward values to train. Deepseek plugged this into their GRPO training approach they utilized for their R1 training.

So how do we train such a model? Enter SPCT. First utilize rejection fine-tuning (RFT) occurs, teaching the model to generate structured critiques as we desire. We generate multiple outputs during training - any marked as “too easy” - aka all critques agreeing, and samples that do not match human preference, are both rejected and not used in training.

In the second phase researchers utilized rule-based reinforcement learning to optimize the model by allowing it to generate critiques for response pairs with a known preferred option. The successful (or not) selection of the known option is then fed back to the reward model again using GRPO.

A further model is introduced (a meta-reward model, for a critic to our critic!). DeepSeek introduced Meta RM which was trained to intake the prompt, response, and critiques, outputting a simple scalar reward value rating our critiques (where in we assume we have generated several per response). Each critique is being judged on its coherency and justification of its scoring. On top of providing a latent explanation layer, it can act as a self-consistency filter. Given N critiques judging a singular trajectory, we can use the Meta RM to filter out poor performing entities (i.e. ones that fail to properly encapsulate the ideals we are looking for) so that an averaging of critique scores can carefully represent the validity of our responses. Paired with multiple response generations of outputs to a given prompt you’ll see a (admittingly compute intensive) improvement in performance and accuracy of your AI.

Once the GRM is trained, you can use it to train multiple base models. I suspect once DeepSeek releases their GRM, which might be powering the soon-to-be-released DeepSeek-R2, we will see an explosion of fine-tuned domain-specific models. It is likely we will be seeing an increase of LLM performance in domains as-to-of-yet been dominated by AI.

I have been curious if the same principle of GRM output could be created via prompting for a few-shot LLM as a judge response. I very much like the idea of utilizing competing GRMs that argue with suggested fixes resulting in a self-improving loop. I’m tempted to dive into building this with arkaine, but have too many projects on the docket at this moment. The price of being in such a rapidly advancing field, I suppose.