Paul Gölz (Mastodon in bio) (@paulgoelz) 's Twitter Profile
Paul Gölz (Mastodon in bio)

@paulgoelz

I think about democracy from a computer science perspective. he/him. Since this site is going downhill, my Mastodon: econtwitter.net/@goelz

ID: 919732722725281792

linkhttps://paulgoelz.de calendar_today16-10-2017 01:12:23

3 Tweet

115 Followers

50 Following

Ariel Procaccia (@arielprocaccia) 's Twitter Profile Photo

Computational social choice in action: Our open-source sortition system, Panelot, supported this live citizens' panel selection process organized by the nonprofit Of By For All. (Collaborators: Bailey Flanigan, Paul Goelz and Anupam Gupta.) citizenspanel.us

Ariel Procaccia (@arielprocaccia) 's Twitter Profile Photo

Our paper on fair algorithms for selecting citizens' assemblies, which boasts the nuanced title "Fair Algorithms for Selecting Citizens' Assemblies," was just published (open access) in Nature. The work was led by the amazing Bailey Flanigan and Paul Gölz (Mastodon in bio). nature.com/articles/s4158…

Nika Haghtalab (@nhaghtal) 's Twitter Profile Photo

RLHF fine-tunes to a “mythical user” via aggregated feedback—but what if that user represents no one? Excited to share a new paper with Paul Gölz and Kunhe Yang “Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?” #AIAlignment #PluralisticAI #LLMs

RLHF fine-tunes to a “mythical user” via aggregated feedback—but what if that user represents no one?
Excited to share a new paper with <a href="/paulgoelz/">Paul Gölz</a> and <a href="/KunheYang/">Kunhe Yang</a> “Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?”
#AIAlignment #PluralisticAI #LLMs
Nika Haghtalab (@nhaghtal) 's Twitter Profile Photo

Different users disagree on how usable, helpful or ethical a response is—that’s their utility. A minimal goal for alignment: optimize average utility. Define distortion = (optimal avg utility if you knew users’ true utilities) ÷ (avg utility of the aligned policy) Lower = better

Nika Haghtalab (@nhaghtal) 's Twitter Profile Photo

Takeaway1⃣: There is a fundamental limit. Even with infinite data: no method can beat β/2 distortion, where β is the Bradley–Terry temperature. Pairwise feedback just isn’t rich enough to optimize even for average utility.

Nika Haghtalab (@nhaghtal) 's Twitter Profile Photo

Takeaway 2⃣: RLHF and DPO might go off the rails --- distortion can scale ∝ exp(β) and even become unbounded. Also, their distortion is highly sensitive to how the comparison data are sampled: tweak the distribution and performance might reduce significantly.

Nika Haghtalab (@nhaghtal) 's Twitter Profile Photo

Takeaway 3⃣: Nash Learning from Human Feedback --- a.k.a. Maximal Lotteries in social-choice theory --- achieves the minimax-optimal distortion bound, and that guarantee holds regardless of how comparisons are sampled or how you set your regularization.🥳

Nika Haghtalab (@nhaghtal) 's Twitter Profile Photo

Beyond alignment: Under Bradley–Terry noise, our distortion bounds offer a more meaningful lens for social choice. Our distortion of constant in # responses for Borda and others gets around pathological examples that cause them to have infinite distortion without the BT model.

Nika Haghtalab (@nhaghtal) 's Twitter Profile Photo

We don’t model leaderboards directly, but our results still give insights: Borda-based methods (e.g., Chatbot Arena) can crown models that are β× worse in avg utility than alternative. This is fertile grounds for future research on leaderboards, utility, and distortion!

Nika Haghtalab (@nhaghtal) 's Twitter Profile Photo

Overall, I'm really interested in better understanding how to make alignment work for real users not the mythical ones 🦄! Get in touch to share your insights. #AIAlignment #MachineLearning #PluralisticAI