Paul Gölz (Mastodon in bio) (@paulgoelz) Twitter Tweets • TwiCopy

Paul Gölz (Mastodon in bio)

@paulgoelz

+ Follow

I think about democracy from a computer science perspective. he/him. Since this site is going downhill, my Mastodon: econtwitter.net/@goelz

ID: 919732722725281792

linkhttps://paulgoelz.de calendar_today16-10-2017 01:12:23

3 Tweet

115 Followers

50 Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Computational social choice in action: Our open-source sortition system, Panelot, supported this live citizens' panel selection process organized by the nonprofit Of By For All. (Collaborators: Bailey Flanigan, Paul Goelz and Anupam Gupta.) citizenspanel.us

thumb_up_off_alt24

chat_bubble_outline1

repeat5

shareShare

Ariel Procaccia

@arielprocaccia

4 years ago

Our paper on fair algorithms for selecting citizens' assemblies, which boasts the nuanced title "Fair Algorithms for Selecting Citizens' Assemblies," was just published (open access) in Nature. The work was led by the amazing Bailey Flanigan and Paul Gölz (Mastodon in bio). nature.com/articles/s4158…

thumb_up_off_alt89

chat_bubble_outline1

repeat19

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

RLHF fine-tunes to a “mythical user” via aggregated feedback—but what if that user represents no one? Excited to share a new paper with Paul Gölz and Kunhe Yang “Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?” #AIAlignment #PluralisticAI #LLMs

RLHF fine-tunes to a “mythical user” via aggregated feedback—but what if that user represents no one?
Excited to share a new paper with <a href="/paulgoelz/">Paul Gölz</a> and <a href="/KunheYang/">Kunhe Yang</a> “Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?”
#AIAlignment #PluralisticAI #LLMs

thumb_up_off_alt100

chat_bubble_outline5

repeat19

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

The paper is available at arxiv.org/abs/2505.23749

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

Different users disagree on how usable, helpful or ethical a response is—that’s their utility. A minimal goal for alignment: optimize average utility. Define distortion = (optimal avg utility if you knew users’ true utilities) ÷ (avg utility of the aligned policy) Lower = better

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

Takeaway1⃣: There is a fundamental limit. Even with infinite data: no method can beat β/2 distortion, where β is the Bradley–Terry temperature. Pairwise feedback just isn’t rich enough to optimize even for average utility.

thumb_up_off_alt1

chat_bubble_outline1

repeat1

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

Takeaway 2⃣: RLHF and DPO might go off the rails --- distortion can scale ∝ exp(β) and even become unbounded. Also, their distortion is highly sensitive to how the comparison data are sampled: tweak the distribution and performance might reduce significantly.

thumb_up_off_alt1

chat_bubble_outline1

repeat1

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

Takeaway 3⃣: Nash Learning from Human Feedback --- a.k.a. Maximal Lotteries in social-choice theory --- achieves the minimax-optimal distortion bound, and that guarantee holds regardless of how comparisons are sampled or how you set your regularization.🥳

thumb_up_off_alt1

chat_bubble_outline1

repeat1

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

Beyond alignment: Under Bradley–Terry noise, our distortion bounds offer a more meaningful lens for social choice. Our distortion of constant in # responses for Borda and others gets around pathological examples that cause them to have infinite distortion without the BT model.

thumb_up_off_alt1

chat_bubble_outline1

repeat1

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

We don’t model leaderboards directly, but our results still give insights: Borda-based methods (e.g., Chatbot Arena) can crown models that are β× worse in avg utility than alternative. This is fertile grounds for future research on leaderboards, utility, and distortion!

thumb_up_off_alt5

chat_bubble_outline1

repeat1

shareShare

Nika Haghtalab

@nhaghtal

2 months ago

Overall, I'm really interested in better understanding how to make alignment work for real users not the mythical ones 🦄! Get in touch to share your insights. #AIAlignment #MachineLearning #PluralisticAI

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare