Interpreting Language Model Preferences Through the Lens of Decision Trees
A decision-tree perspective to interpret LLM preference mechanisms.
A decision-tree perspective to interpret LLM preference mechanisms.
An interpretable reward modeling approach.
A guidebook for LLM alignment.
This is the recipe for the RLHFlow/RLHF-Reward-Modeling repository used to train the reward model for RLHF.