Posts

Interpreting Language Model Preferences Through the Lens of Decision Trees

A decision-tree perspective to interpret LLM preference mechanisms.

An interpretable reward modeling approach.

A guidebook for LLM alignment.

This is the recipe for the RLHFlow/RLHF-Reward-Modeling repository used to train the reward model for RLHF.