Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
An interpretable reward modeling approach.
An interpretable reward modeling approach.
A guidebook for LLM alignment.
This is the recipe for the RLHFlow/RLHF-Reward-Modeling repository used to train the reward model for RLHF.