Alessandro Stolfo (@alesstolfo) 's Twitter Profile
Alessandro Stolfo

@alesstolfo

PhD Student @ETH - LLM Interpretability - Prev. @MSFTResearch @oracle

ID: 537527790

linkhttps://alestolfo.github.io calendar_today26-03-2012 18:23:25

109 Tweet

1,1K Followers

568 Following

Marktechpost AI Research News ⚡ (@marktechpost) 's Twitter Profile Photo

Microsoft AI Introduces Activation Steering: A Novel AI Approach to Improving Instruction-Following in Large Language Models Researchers from ETH Zürich and Microsoft Research introduced a novel method to tackle these limitations: activation steering. This approach moves away

Microsoft AI Introduces Activation Steering: A Novel AI Approach to Improving Instruction-Following in Large Language Models

Researchers from ETH Zürich and Microsoft Research introduced a novel method to tackle these limitations: activation steering. This approach moves away
Mazda Moayeri (@mlmazda) 's Twitter Profile Photo

So you’re seeing 1-2% gains over some benchmarks… but what does this really mean? Where do these gains come from? With “skill-slices”, you can unearth the specific *skills* that a model improves most over another. 🧵on my MSR internship work 🤠

So you’re seeing 1-2% gains over some benchmarks… but what does this really mean? Where do these gains come from?

With “skill-slices”, you can unearth the specific *skills* that a model improves most over another.

🧵on my MSR internship work 🤠
Natasha Butt (@natashaeve4) 's Twitter Profile Photo

Introducing BenchAgents: a framework for automated benchmark creation, using multiple LLM agents that interact with each other and with developers to generate diverse, high-quality, and challenging benchmarks w/ Varun Chandrasekaran Neel Joshi Besmira Nushi 💙💛 Vidhisha Balachandran Microsoft Research 🧵1/8

Introducing BenchAgents: a framework for automated benchmark creation, using multiple LLM agents that interact with each other and with developers to generate diverse, high-quality, and challenging benchmarks w/ <a href="/VarunChandrase3/">Varun Chandrasekaran</a> <a href="/neelsj/">Neel Joshi</a> <a href="/besanushi/">Besmira Nushi 💙💛</a> <a href="/vidhisha_b/">Vidhisha Balachandran</a> <a href="/MSFTResearch/">Microsoft Research</a> 🧵1/8
Neel Nanda (@neelnanda5) 's Twitter Profile Photo

NeurIPS has an overwhelming amount of papers, so I made myself a hacky spreadsheet of all (well, most) of the interpretability papers - sharing in case others find it useful! It's definitely got false negatives and positives, but hopefully is better than baseline.

NeurIPS has an overwhelming amount of papers, so I made myself a hacky spreadsheet of all (well, most) of the interpretability papers - sharing in case others find it useful!

It's definitely got false negatives and positives, but hopefully is better than baseline.
Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

Excited to be at #NeurIPS2024 presenting our mech interp work with Ben Wu @ICLR & Neel Nanda: Confidence Regulation Neurons in Language Models! Come check out our poster on Thursday at 11am, East Exhibit Hall A-C (#3105). Hope to see you there!

Excited to be at #NeurIPS2024 presenting our mech interp work with <a href="/benwu_ml/">Ben Wu @ICLR</a>  &amp; <a href="/NeelNanda5/">Neel Nanda</a>: Confidence Regulation Neurons in Language Models!

Come check out our poster on Thursday at 11am, East Exhibit Hall A-C (#3105). Hope to see you there!
Alice Bizeul (@alicebizeul) 's Twitter Profile Photo

✨New Preprint ✨ Ever thought that reconstructing masked pixels for image representation learning seems sub-optimal? In our new preprint, we show how masking principal components—rather than raw pixel patches— improves Masked Image Modelling (MIM). Find out more below 🧵

✨New Preprint ✨ Ever thought that reconstructing masked pixels for image representation learning seems sub-optimal?

In our new preprint, we show how masking principal components—rather than raw pixel patches— improves Masked Image Modelling (MIM).

Find out more below 🧵
Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

Our paper "Improving Instruction-Following in Language Models through Activation Steering” has been accepted to #ICLR2025! We're also excited to share that our public GitHub repo is now live. Code: github.com/microsoft/llm-… Camera-ready: arxiv.org/abs/2410.12877

Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

Excited to be at #ICLR2025 to present some recent work! “Improving Instruction-Following in Language Models through Activation Steering” - Friday 3pm, Hall 3 + Hall 2B (#293) “Antipodal Pairing and Mechanistic Signals in Dense SAE Latents” - Sunday 11am @ SparseLLM workshop

Excited to be at #ICLR2025 to present some recent work!

“Improving Instruction-Following in Language Models through Activation Steering” - Friday 3pm, Hall 3 + Hall 2B (#293)

“Antipodal Pairing and Mechanistic Signals in Dense SAE Latents” - Sunday 11am @ SparseLLM workshop
Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

Excited to have contributed to this effort to benchmark mech interp methods! Check out our paper. And if you’re working on circuit discovery, consider submitting your method 🔍

Vidhisha Balachandran (@vidhisha_b) 's Twitter Profile Photo

In Singapore! We have an exciting set of talks and papers at #ICLR2025. Besmira Nushi 💙💛 , Vibhav and I will be around to chat about our recent inference scaling evaluation report. You can also talk to Mazda Moayeri and Alessandro Stolfo about their internship projects on model understanding.

In Singapore! We have an exciting set of talks and papers at #ICLR2025.

<a href="/besanushi/">Besmira Nushi 💙💛</a> , Vibhav and I will be around to chat about our recent inference scaling evaluation report. You can also talk to <a href="/MLMazda/">Mazda Moayeri</a>  and <a href="/alesstolfo/">Alessandro Stolfo</a> about their internship projects on model understanding.
Zhijing Jin✈️ ICLR Singapore (@zhijingjin) 's Twitter Profile Photo

Very honored to be one of the 15,553 runners today in #SOLA Relay Zürich. And also super proud of our #NLProc team of 14 finishing 113km in total! Many many thanks to all the friends & our Prof Mrinmaya Sachan! It's such a meaningful day in my life. Yet to run for #EMNLP now ;)!

Very honored to be one of the 15,553 runners today in #SOLA Relay Zürich. And also super proud of our #NLProc team of 14 finishing 113km in total! Many many thanks to all the friends &amp; our Prof <a href="/mrinmayasachan/">Mrinmaya Sachan</a>! It's such a meaningful day in my life. Yet to run for #EMNLP now ;)!
Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

Many SAEs learn latents that activate on almost all tokens. Are these undesired phenomena or meaningful features? In our new work, we show that many of these "dense" latents are real, interpretable signals in LLMs. Paper: arxiv.org/abs/2506.15679 👇 summary thread by lily (xiaoqing)

Many SAEs learn latents that activate on almost all tokens. Are these undesired phenomena or meaningful features?
In our new work, we show that many of these "dense" latents are real, interpretable signals in LLMs.

Paper: arxiv.org/abs/2506.15679

👇 summary thread by <a href="/lilysun004/">lily (xiaoqing)</a>
Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

New cool work led by Alan (Alan Chen): we show how to transfer SAE latents and steering vectors between LLMs of different sizes using simple affine mappings. Check out the paper: arxiv.org/abs/2506.06609 And the thread below 👇

Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

New paper on detecting & correcting arithmetic errors in LLMs! We show that simple probes can recover correct answers from hidden states and trigger self-correction of reasoning errors. 📍 If you’re at #ICML2025 stop by our poster @ the Act Interp WS 📝arxiv.org/abs/2507.12379

Alessandro Stolfo (@alesstolfo) 's Twitter Profile Photo

Had a great time speaking at NEC Laboratories Europe about using activation steering for better instruction-following in LLMs! Check out the talk 🗣️: youtu.be/3ozuaGaEjpo?si… and paper 📜: arxiv.org/abs/2410.12877 This work that I did at Microsoft Research shows how interpretability-based