Yoav Gur Arieh (@guryoav) Twitter Tweets • TwiCopy

Yoav Gur Arieh

@guryoav

+ Follow

CS MS Student | Researching LLM interpretability

ID: 1160931956458184706

linkhttp://yoav.ml calendar_today12-08-2019 15:12:02

21 Tweet

51 Followers

150 Following

Ekdeep Singh Lubana

@ekdeepl

a year ago

Paper alert––*Awarded best paper* at NeurIPS workshop on Foundation Model Interventions! 🧵👇 We analyze the (in)abilities of SAEs by relating them to the field of disentangled rep. learning, where limitations of AE based interpretability protocols have been well established!🤯

thumb_up_off_alt493

chat_bubble_outline6

repeat84

shareShare

Mor Geva

@megamor2

10 months ago

How can we interpret LLM features at scale? 🤔 Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs! We propose efficient output-centric methods that better predict how steering a feature will affect model outputs. New

thumb_up_off_alt112

chat_bubble_outline6

repeat27

shareShare

Ohav

@ohavba

9 months ago

"One bad apple can spoil the bunch 🍎", and that's doubly true for language agents! Our new paper shows how monitoring and intervention can prevent agents from going rogue, boosting performance by up to 20%. We're also releasing a new multi-agent environment 🕵️‍♂️

thumb_up_off_alt26

chat_bubble_outline2

repeat6

shareShare

Sohee Yang

@soheeyang_

5 months ago

🚨 New Paper 🧵 How effectively do reasoning models reevaluate their thought? We find that: - Models excel at identifying unhelpful thoughts but struggle to recover from them - Smaller models can be more robust - Self-reevaluation ability is far from true meta-cognitive awareness

thumb_up_off_alt103

chat_bubble_outline3

repeat24

shareShare

Ido Cohen

@idoc0hen

4 months ago

A Vision-Language Model can answer questions about Robin Williams. It can also recognize him in a photo. So why does it FAIL when asked the same questions using his photo instead of his name? A thread on our new #acl2025 paper that explores this puzzle 🧵

thumb_up_off_alt24

chat_bubble_outline1

repeat7

shareShare