This work is authored by Haoxiang Wang*, Wei Xiong*, Tengyang Xie, Han Zhao, Tong Zhang (* indicates equal contribution)


Abstract

Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, we believe they should be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. We efficiently trained an ArmoRM with Llama3-8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our final reward model, ArmoRM-Llama3-8B-v0.1, ranks first on the leaderboard of RewardBench, a benchmark evaluating RMs for language modeling. The performance of our model surpasses the LLM-as-a-judge approach using GPT-4 and the common Bradley-Terry modeling approach with Llama3-8B or Yi-34B by a margin.

Preliminaries

RLHF Pipeline

The standard RLHF-based alignment pipeline, as established by the foundational InstructGPT work (the algorithmic framework behind ChatGPT), involves three main stages:

InstructGPT

  1. Supervised Fine-Tuning (SFT): This initial stage involves training the language model on a dataset of human-written responses to align the model’s outputs with human expectations. This stage sets a baseline for the model’s understanding of tasks and appropriate responses.
  2. Reward Modeling: In this stage, the model is trained to predict the rewards (usually human preferences or ratings) associated with different outputs. This reward model is critical for evaluating the quality of model-generated responses and serves as a foundation for subsequent policy optimization. For more details about the reward modeling, you can read our previous blog post to understand the common approach of Bradley-Terry reward modeling.
  3. Policy Optimization: During this final stage, the model is fine-tuned to maximize the expected rewards as estimated by the reward model, thereby aligning the model’s outputs even more closely with human preferences.

The term RLHF usually refers to the latter two stages, excluding the SFT stage. RLHF can be understood as an approach for preference tuning, which encourages the model to output human-preferred responses for a wide range of prompts. While there is a line of works bypassing the reward modeling stage by proposing offline direct preference learning algorithms like Slic, DPO, and IPO, recent works show that leveraging an external reward model to iteratively label on-policy data is also helpful to improve model performance.

The Need for Interpretable Reward Models

Reward models (RMs) play a crucial role in the alignment of LLMs using RLHF. They provide a scalable and efficient way to capture human preferences and guide the optimization of LLM policies. However, common RMs, such as the most popular Bradley-Terry RMs, are typically black-box models that output scores or preferences without providing human-interpretable explanations.

BT

Furthermore, when applying RLHF for LLM alignment, the phenomenon of reward hacking is widely observed, where the aligned LLMs generate high-reward responses (rated by the RM) that do not align with actual human preferences. A notable example of this is the verbosity bias, where aligned LLMs produce longer-than-necessary responses because the RM favors length, regardless of quality. The following figure from Singhal et al., 2023 illustrates how a preference for longer responses by the RM leads to more verbose outputs from the corresponding LLM.

Verbosity

How can we mitigate the reward hacking issue? We believe one solution is to make the reward model more interpretable and debuggable. Let’s continue considering the verbosity bias example. Suppose the RM’s output is interpretable, explaining that it assigns a high score to a response due to two factors: 40% for its helpfulness and 60% for its length. In this case, we can see that the RM has a verbosity bias. Furthermore, if the RM is debuggable, we could adjust its decision-making process to base its scoring 100% on helpfulness, regardless of response length, thus mitigating the verbosity bias.

Enhancing the interpretability of RMs also allows humans to verify whether RMs have similar internal decision processes to humans when acting as proxies for human preferences. We believe that this thorough human verification process could ensure that RMs are deeply and comprehensively consistent with human values and preferences, making RM-aligned LLMs more reliable and robust.

Multi-Objective Reward Modeling Meets Mixture-of-Experts

Our proposed approach consists of two stages: i) Multi-Objective Reward Modeling and ii) Mixture-of-Experts Aggregation of Reward Objectives, which are explained in detail below.

Stage-1: Multi-Objective Reward Modeling

Most existing reward models for LLM alignment are trained with Bradley-Terry loss on pairwise data with annotated preferences (please check out our previous blog for background and more details), using the same approach as InstructGPT. The pairwise preference annotations are essentially binary labels, e.g., ${0, 1}$, indicating which response is preferred by the annotator. We call them relative ratings here. However, the relative ratings of some recent high-quality datasets are converted from absolute ratings. For instance, the UltraFeedback dataset is curated with 5-objective absolute ratings: Overall Score, Instruction Following, Truthfulness, Honesty, and Helpfulness (each objective has 5 distinct ratings based on pre-defined rubrics). The dataset is further binarized by into pairwise comparisons by using the Overall Score, or the average score of the remaining 4 objectives, for training reward models or DPO.

The original ratings are fine-grained, as each objective has continuous integer rating scores (e.g., 1, 2, 3, 4, 5). However, the binarization process discards some fine-grained information. For example, a pair of examples with scores 1:5 is labeled in the same way as another pair with scores 2:3. It is not justified that discarding the fine-grained preference information is beneficial; hence, we would like to include all fine-grained information for reward modeling.

Absolute-Rating Multi-Objective Reward Model (ArmoRM)

As the training examples come with multi-objective ratings, the straightforward approach for learning with these ratings is multi-objective regression, which is also adopted in Directional Preference Alignment (DPA) and HelpSteer. Here, we briefly introduce the training procedure.

ArmoRM

We consider each example to consist of a prompt $x$ (including contexts from previous conversation turns), response $y$, and a $k$-dimensional rating vector $r\in \mathbb{R}^{k}$, where each dimension corresponds to a reward objective such as helpfulness and truthfulness. Now, we take a pre-trained decoder-only LLM without the original output linear layer as the feature extractor $f_\theta$, and pass $(x,y)$ through the decoder layers to take the hidden state of the final decoder layer on the last token as a $d$-dimensional feature. Also, we attach a new linear regression layer $w\in \mathbb{R}^{d \times k}$ on top of $f_\theta$, which outputs $k$-dimensional rating prediction. The model can be straightforwardly trained with regression loss:

$$ \min_{\theta, w} \mathbb{E}_ {x,y,r} || w^\top f_\theta(x,y) - r ||_2^2 $$

Regression

Implementation of ArmoRM

We provide the implementation details of our ArmoRM model, including the architecture, parameter initialization, training procedure, and datasets used.

  • Base Model: Llama-3 8B
  • Parameter Initialization: FsfairX-LLaMA3-RM-v0.1, a Bradley-Terry reward model trained from Llama3-8B-Instruct, using our RLHFlow codebase for Reward Modeling.
  • Training: Linear Probing (training the newly initialized linear layer only while keeping all transformer layers frozen)
  • Datasets: HelpSteer, UltraFeedback, BeaverTails, Argilla-Capybara, Argilla-Math-Preferences, CodeUltraFeedback, Argilla-OpenOrca
  • Objectives: We have $k=19$ reward objectives in total obtained from the datasets:
    • HelpSteer: helpsteer-helpfulness,helpsteer-correctness,helpsteer-coherence, helpsteer-complexity,helpsteer-verbosity
    • UltraFeedback: ultrafeedback-overall_score, ultrafeedback-instruction_following, ultrafeedback-truthfulness, ultrafeedback-honesty,ultrafeedback-helpfulness
    • BeaverTails: beavertails-is_safe
    • CodeUltraFeedback: code-complexity, code-style,code-explanation,code-instruction-following,code-readability
    • Prometheus: prometheus-score
    • Argilla-Capybara: argilla-overall_quality
    • Argilla-OpenOrca: argilla-judge_lm
  • Data Processing: When merging multiple datasets with absolute ratings (e.g., UltraFeedback and HelpSteer), we observe some issues with the data. Here, we present the issues and our approach to tackle them:
    1. Different Rating Scales: Different datasets may have different scales for the ratings. For instance, HelpSteer has a rating scale of 0-4, while UltraFeedback’s is 1-10. We linearly transform all ratings to make them between 0 and 1. For BeaverTails with True/False ratings (indicating safe or unsafe), we treat True as 1 and False as 0.
    2. Similar Objectives: There are some very similar objectives from different datasets. For example, the Helpfulness objective appears in both HelpSteer and UltraFeedback, and the Correctness objective of HelpSteer is quite similar to the Truthfulness of UltraFeedback. After carefully examining the datasets, we decided to treat similar objectives as separate objectives, as they are rated by different judges following different rubrics. For instance, data from HelpSteer are rated by 200 U.S.-based human annotators following customized rubrics, and UltraFeedback data are labeled with GPT-4 following another set of rubrics.
    3. Missing Labels of the Merged Dataset: When merging multiple datasets, each example of the merged dataset only has a subset of ratings; for example, each example from HelpSteer only has 5 ratings originating from the HelpSteer dataset, and it does not have ratings for other objectives (e.g., the objectives from UltraFeedback or BeaverTails). Hence, when optimizing the regression loss, we simply ignore the missing rating dimensions of each example and only compute the loss on the remaining dimensions.

Stage-2: Mixture-of-Experts Aggregation of Reward Objectives

An ArmoRM can predict multi-objective rewards for each response. However, the multi-dimensional outputs need to be aggregated to a single dimension for ranking or pairwise comparisons of test examples. A straightforward approach is to take a linear combination of multiple objectives; however, fixed combination coefficients are too rigid for complex application scenarios. For instance, for prompts that could easily trigger unsafe responses, the safety objective should be assigned a large coefficient, as we wish the reward model to rank unsafe responses lower than safe ones. However, for prompts for math problem assistance, the safety objective becomes almost useless, and the helpfulness-related objectives should be the primary focus.

ArmoRM with Mixture-of-Experts (Armo-MoE)

With the insight mentioned above, we propose a MoE-style aggregation of reward objectives, conditioned on the prompt $x$. On the architecture level, we just need to follow the common MoE practice to add a gating layer, $g_\phi : \mathbb{R}^d \mapsto {v\in \mathcal R^{k}\mid v_i\geq 0 ~\mathrm{and}~\sum v_i = 1 }$, that outputs non-negative coefficients (summing up to 1) for the reward objectives based on the feature extracted from the prompt, $f_\theta(x) \in \mathbb{R}^d$, i.e., the hidden state on the last token of $x$. Notice that $f(x)$ is provided for free in the forward pass of $f_\theta(x,y)$, making the pipeline inference-efficient.

ArmoRM-MoE

The gating layer $g_\phi$ can simply be a shallow MLP (i.e., fully-connected network) that takes the prompt feature $f_\theta(x)$ and outputs a $k$-dimensional vector, followed by a softmax function to ensure the elements of the output vector are non-negative and summing up to 1”.

However, most reward objectives are highly correlated with verbosity, which indicates a strong verbosity bias. Using non-negative gating coefficients would make the final output inherit the bias. To resolve the issue, we adjust each reward objective, $r_i$, with a penalty using the verbosity reward objective,

$$ r_i’ \leftarrow r_i - \lambda_i r_{\mathrm{verbose}} $$

where the penalty coefficient $\lambda_i$ is chosen such that for a proper correction metric (e.g., Pearson or Spearman correlation coefficient) and a reference data distribution $\mathcal D$,

$$ \mathbb{E}_ {\mathcal D}\mathrm{Corr}(r_i’, r_{\mathrm{verbose}}) = 0 $$

The adjusted reward vector is denoted as $r’\in \mathbb{R}^k$.

Finally, we multiply the gating coefficients to the multi-objective rewards, to obtain a scalar score $s$ for the response $y$ given prompt $x,$

$$ R = g_\phi(f_\theta(x))^\top r' $$

To train the gating layer, we freeze the parameters of the backbone and the regression layer, and only train the gating layer with the Bradley-Terry loss,

$$ \min_\phi \mathbb{E} \left[ -\log \frac{\exp(R_{\mathrm{chosen}})}{\exp(R_\mathrm{chosen}+R_\mathrm{rejected})} \right] $$

where $R_{\mathrm{chosen}}$ and $R_{\mathrm{rejected}}$ are the preference scores for the chosen and rejected responses in each pairwise example, $(x, y_{\mathrm{chosen}}, y_{\mathrm{rejected}})$.

Implementation of ArmoRM-MoE

The gating layer is trained on top of the ArmoRM obtained from stage-1. Here we provide implementation details:

Empirical Results: SoTA on Reward-Bench

We present the evaluation results of our ArmoRM model on the Reward-Bench benchmark, which consists of a diverse set of tasks designed to assess the performance of reward models for LLM alignment. The table below compares the performance of our model with other state-of-the-art approaches, demonstrating the superiority of our method across various domains.

ModelBase ModelMethodScoreChatChat HardSafetyReasoningPrior Sets (0.5 weight)
ArmoRM-Llama3-8B-v0.1Llama-3 8BArmoRM + MoE89.096.976.892.297.374.3
Cohere May 2024UnknownUnknown88.396.471.392.797.778.2
pair-preference-modelLlama-3 8BSliC-HF85.798.365.889.794.774.6
GPT-4 Turbo (0125 version)GPT-4 TurboLLM-as-a-Judge84.395.374.387.286.970.9
FsfairX-LLaMA3-RM-v0.1Llama-3 8BBradley-Terry83.699.465.187.886.474.9
Starling-RM-34BYi-34BBradley-Terry81.496.957.288.288.571.4

The table above presents the evaluation results of our ArmoRM-MoE model on the Reward-Bench benchmark, along with comparisons to other state-of-the-art approaches. Several key observations can be made from these results:

  1. Our ArmoRM-Llama3-8B-v0.1 model significantly outperforms FsfairX-LLaMA3-8B-RM-v0.1, which is the initialization of our model. This demonstrates the effectiveness of our ArmoRM design and the MoE gating mechanism in improving the performance of reward models.
  2. Our model also outperforms the LLM-as-a-Judge approach with a GPT-4 judge by a considerable margin, indicating that our model could be used as a replacement for GPT-4 in many annotation jobs or even serve as a judge model for benchmarks (e.g., MT-Bench, AlpacaEval-2.0, ArenaHard).
  3. The Cohere May 2024 model, developed by Cohere AI, is a closed model with unknown size and training details. Despite the lack of information about this model, our ArmoRM-Llama3-8B-v0.1 still manages to outperform it on the Reward-Bench benchmark.

Usage Example (Code Demo)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = "cuda"
path = "RLHFlow/ArmoRM-Llama3-8B-v0.1"
model = AutoModelForSequenceClassification.from_pretrained(path, device_map=device, 
                               trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True)
# We load a random sample from the validation set of the HelpSteer dataset
prompt = 'What are some synonyms for the word "beautiful"?'
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
messages = [{"role": "user", "content": prompt},
           {"role": "assistant", "content": response}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
with torch.no_grad():
   output = model(input_ids)
   # Multi-objective rewards for the response
   multi_obj_rewards = output.rewards.cpu().float() 
   # The gating layer's output is conditioned on the prompt
   gating_output = output.gating_output.cpu().float()
   # The preference score for the response, aggregated from the 
   # multi-objective rewards with the gating layer
   preference_score = output.score.cpu().float()  
# We apply a transformation matrix to the multi-objective rewards
# before multiplying with the gating layer's output. This mainly aims
# at reducing the verbosity bias of the original reward objectives
obj_transform = model.reward_transform_matrix.data.cpu().float()
# The final coefficients assigned to each reward objective
multi_obj_coeffs = gating_output @ obj_transform.T
# The preference score is the linear combination of the multi-objective rewards with
# the multi-objective coefficients, which can be verified by the following assertion
assert torch.isclose(torch.sum(multi_obj_rewards * multi_obj_coeffs, dim=1), preference_score, atol=1e-3) 
# Find the top-K reward objectives with coefficients of the highest magnitude
K = 3
top_obj_dims = torch.argsort(torch.abs(multi_obj_coeffs), dim=1, descending=True,)[:, :K]
top_obj_coeffs = torch.gather(multi_obj_coeffs, dim=1, index=top_obj_dims)

# The attributes of the 19 reward objectives
attributes = ['helpsteer-helpfulness','helpsteer-correctness','helpsteer-coherence',
   'helpsteer-complexity','helpsteer-verbosity','ultrafeedback-overall_score',
   'ultrafeedback-instruction_following', 'ultrafeedback-truthfulness',
   'ultrafeedback-honesty','ultrafeedback-helpfulness','beavertails-is_safe',
   'prometheus-score','argilla-overall_quality','argilla-judge_lm','code-complexity',
   'code-style','code-explanation','code-instruction-following','code-readability']

example_index = 0
for i in range(K):
   attribute = attributes[top_obj_dims[example_index, i].item()]
   coeff = top_obj_coeffs[example_index, i].item()
   print(f"{attribute}: {round(coeff,5)}")
# code-complexity: 0.19922
# helpsteer-verbosity: -0.10864
# ultrafeedback-instruction_following: 0.07861

# The actual rewards of this example from the HelpSteer dataset
# are [3,3,4,2,2] for the five helpsteer objectives: 
# helpfulness, correctness, coherence, complexity, verbosity
# We can linearly transform our predicted rewards to the 
# original reward space to compare with the ground truth
helpsteer_rewards_pred = multi_obj_rewards[0, :5] * 5 - 0.5
print(helpsteer_rewards_pred)
# [2.78125   2.859375  3.484375  1.3847656 1.296875 ]

Citation

If you find this work useful for your research, please consider citing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@article{ArmoRM,
      title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts}, 
      author={Haoxiang Wang and Wei Xiong and Tengyang Xie and Han Zhao and Tong Zhang},
      journal={arXiv preprint arXiv:2406.12845},
}

@inproceedings{wang2024arithmetic,
      title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards}, 
      author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang},
      year={2024},
      booktitle={ACL},
}

The second entry, “Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards”, is another recent work of ours that trained a multi-objective reward model and adopted it for LLM alignment, which motivated us to develop the current work.