Hacker News Logo

Offline

dayweek

Reinforcement Learning from Human Feedback

87 points|rlhfbook.com|
onurkanbkrc|9hrs

https://arxiv.org/abs/2504.12501