Yoav Gur Arieh (@guryoav) 's Twitter Profile
Yoav Gur Arieh

@guryoav

CS MS Student | Researching LLM interpretability

ID: 1160931956458184706

linkhttp://yoav.ml calendar_today12-08-2019 15:12:02

21 Tweet

51 Followers

150 Following

Ekdeep Singh Lubana (@ekdeepl) 's Twitter Profile Photo

Paper alert––*Awarded best paper* at NeurIPS workshop on Foundation Model Interventions! 🧵👇 We analyze the (in)abilities of SAEs by relating them to the field of disentangled rep. learning, where limitations of AE based interpretability protocols have been well established!🤯

Mor Geva (@megamor2) 's Twitter Profile Photo

How can we interpret LLM features at scale? 🤔 Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs! We propose efficient output-centric methods that better predict how steering a feature will affect model outputs. New

Sohee Yang (@soheeyang_) 's Twitter Profile Photo

🚨 New Paper 🧵 How effectively do reasoning models reevaluate their thought? We find that: - Models excel at identifying unhelpful thoughts but struggle to recover from them - Smaller models can be more robust - Self-reevaluation ability is far from true meta-cognitive awareness

🚨 New Paper 🧵
How effectively do reasoning models reevaluate their thought? We find that:
- Models excel at identifying unhelpful thoughts but struggle to recover from them
- Smaller models can be more robust
- Self-reevaluation ability is far from true meta-cognitive awareness