Alessandro Stolfo (@alesstolfo) Twitter Tweets • TwiCopy

Marktechpost AI Research News ⚡

10 months ago

Microsoft AI Introduces Activation Steering: A Novel AI Approach to Improving Instruction-Following in Large Language Models Researchers from ETH Zürich and Microsoft Research introduced a novel method to tackle these limitations: activation steering. This approach moves away

thumb_up_off_alt17

chat_bubble_outline0

repeat5

shareShare

Mazda Moayeri

@mlmazda

10 months ago

So you’re seeing 1-2% gains over some benchmarks… but what does this really mean? Where do these gains come from? With “skill-slices”, you can unearth the specific *skills* that a model improves most over another. 🧵on my MSR internship work 🤠

thumb_up_off_alt48

chat_bubble_outline1

repeat7

shareShare

Natasha Butt

@natashaeve4

10 months ago

Introducing BenchAgents: a framework for automated benchmark creation, using multiple LLM agents that interact with each other and with developers to generate diverse, high-quality, and challenging benchmarks w/ Varun Chandrasekaran Neel Joshi Besmira Nushi 💙💛 Vidhisha Balachandran Microsoft Research 🧵1/8

thumb_up_off_alt29

chat_bubble_outline2

repeat9

shareShare

Neel Nanda

@neelnanda5

9 months ago

NeurIPS has an overwhelming amount of papers, so I made myself a hacky spreadsheet of all (well, most) of the interpretability papers - sharing in case others find it useful! It's definitely got false negatives and positives, but hopefully is better than baseline.

thumb_up_off_alt444

chat_bubble_outline10

repeat52

shareShare

Alessandro Stolfo

@alesstolfo

9 months ago

Excited to be at #NeurIPS2024 presenting our mech interp work with Ben Wu @ICLR & Neel Nanda: Confidence Regulation Neurons in Language Models! Come check out our poster on Thursday at 11am, East Exhibit Hall A-C (#3105). Hope to see you there!

Excited to be at #NeurIPS2024 presenting our mech interp work with <a href="/benwu_ml/">Ben Wu @ICLR</a> & <a href="/NeelNanda5/">Neel Nanda</a>: Confidence Regulation Neurons in Language Models!

Come check out our poster on Thursday at 11am, East Exhibit Hall A-C (#3105). Hope to see you there!

thumb_up_off_alt42

chat_bubble_outline1

repeat0

shareShare

Alessandro Stolfo

@alesstolfo

9 months ago

Was great to hang out with the multimodal man of the moment Lucas Beyer (bl16)

thumb_up_off_alt36

chat_bubble_outline1

repeat1

shareShare

Alice Bizeul

@alicebizeul

7 months ago

✨New Preprint ✨ Ever thought that reconstructing masked pixels for image representation learning seems sub-optimal? In our new preprint, we show how masking principal components—rather than raw pixel patches— improves Masked Image Modelling (MIM). Find out more below 🧵

thumb_up_off_alt531

chat_bubble_outline17

repeat62

shareShare

Alessandro Stolfo

@alesstolfo

5 months ago

Our paper "Improving Instruction-Following in Language Models through Activation Steering” has been accepted to #ICLR2025! We're also excited to share that our public GitHub repo is now live. Code: github.com/microsoft/llm-… Camera-ready: arxiv.org/abs/2410.12877

thumb_up_off_alt50

chat_bubble_outline5

repeat8

shareShare

Alessandro Stolfo

@alesstolfo

4 months ago

Excited to be at #ICLR2025 to present some recent work! “Improving Instruction-Following in Language Models through Activation Steering” - Friday 3pm, Hall 3 + Hall 2B (#293) “Antipodal Pairing and Mechanistic Signals in Dense SAE Latents” - Sunday 11am @ SparseLLM workshop

thumb_up_off_alt37

chat_bubble_outline1

repeat0

shareShare

Alessandro Stolfo

@alesstolfo

4 months ago

Excited to have contributed to this effort to benchmark mech interp methods! Check out our paper. And if you’re working on circuit discovery, consider submitting your method 🔍

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Vidhisha Balachandran

@vidhisha_b

4 months ago

In Singapore! We have an exciting set of talks and papers at #ICLR2025. Besmira Nushi 💙💛 , Vibhav and I will be around to chat about our recent inference scaling evaluation report. You can also talk to Mazda Moayeri and Alessandro Stolfo about their internship projects on model understanding.

In Singapore! We have an exciting set of talks and papers at #ICLR2025.

<a href="/besanushi/">Besmira Nushi 💙💛</a> , Vibhav and I will be around to chat about our recent inference scaling evaluation report. You can also talk to <a href="/MLMazda/">Mazda Moayeri</a> and <a href="/alesstolfo/">Alessandro Stolfo</a> about their internship projects on model understanding.

thumb_up_off_alt43

chat_bubble_outline0

repeat5

shareShare

Vilém Zouhar

@zouharvi

4 months ago

incredible monetization opportunity (this is a joke)

thumb_up_off_alt15

chat_bubble_outline1

repeat1

shareShare

Zhijing Jin✈️ ICLR Singapore

@zhijingjin

4 months ago

Very honored to be one of the 15,553 runners today in #SOLA Relay Zürich. And also super proud of our #NLProc team of 14 finishing 113km in total! Many many thanks to all the friends & our Prof Mrinmaya Sachan! It's such a meaningful day in my life. Yet to run for #EMNLP now ;)!

thumb_up_off_alt47

chat_bubble_outline1

repeat3

shareShare

Alessandro Stolfo

@alesstolfo

2 months ago

Many SAEs learn latents that activate on almost all tokens. Are these undesired phenomena or meaningful features? In our new work, we show that many of these "dense" latents are real, interpretable signals in LLMs. Paper: arxiv.org/abs/2506.15679 👇 summary thread by lily (xiaoqing)

thumb_up_off_alt138

chat_bubble_outline0

repeat18

shareShare

Alessandro Stolfo

@alesstolfo

2 months ago

New cool work led by Alan (Alan Chen): we show how to transfer SAE latents and steering vectors between LLMs of different sizes using simple affine mappings. Check out the paper: arxiv.org/abs/2506.06609 And the thread below 👇

thumb_up_off_alt12

chat_bubble_outline0

repeat2

shareShare

Alessandro Stolfo

@alesstolfo

2 months ago

New paper on detecting & correcting arithmetic errors in LLMs! We show that simple probes can recover correct answers from hidden states and trigger self-correction of reasoning errors. 📍 If you’re at #ICML2025 stop by our poster @ the Act Interp WS 📝arxiv.org/abs/2507.12379

thumb_up_off_alt20

chat_bubble_outline0

repeat2

shareShare

Alessandro Stolfo

@alesstolfo

a month ago

Had a great time speaking at NEC Laboratories Europe about using activation steering for better instruction-following in LLMs! Check out the talk 🗣️: youtu.be/3ozuaGaEjpo?si… and paper 📜: arxiv.org/abs/2410.12877 This work that I did at Microsoft Research shows how interpretability-based

thumb_up_off_alt17

chat_bubble_outline0

repeat1

shareShare