
Wanqiao Xu
@wanqiao_xu
PhD student @stanford RL Group 🌲formerly @UMich Math 〽️ interested in RL and Finetuning LLM | Previously intern @MetaAI @MSFTResearch
ID: 943661690993885184
http://wanqiaox.github.io 21-12-2017 01:57:34
77 Tweet
211 Followers
401 Following









Prof. Anima Anandkumar For a simple data-generating process (not natural language), we've seen that RLHF moves between collapsing down to a most-preferred response or sticking to the supervised pre-training response distribution, depending on the KL regularization strength arxiv.org/abs/2305.11455







