Andy Arditi (@andyarditi) 's Twitter Profile
Andy Arditi

@andyarditi

Interpretability, jazz, and sometimes jokes.

ID: 1680699075492970499

linkhttp://andyrdt.com calendar_today16-07-2023 22:01:33

89 Tweet

434 Followers

433 Following

Atsushi Yamamura (山村篤志) (@atsushi_y1230) 's Twitter Profile Photo

Excited to share our latest work, "Fooling LLM graders into giving better grades through neural activity-guided adversarial prompting" (w/ Surya Ganguli)! We investigate distorting AI decision-making to build fair and robust AI judges/graders.arxiv.org/abs/2412.15275 #AISafety 1/n

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

New paper: Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵

New paper:
Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks.
Do they also provide more faithful explanations?
Testing on a benchmark,  we find reasoning models are much more faithful.
It seems this isn't due to specialized training but arises from RL🧵
Goodfire (@goodfireai) 's Twitter Profile Photo

We are excited to announce our collaboration with Arc Institute on their state-of-the-art biological foundation model, Evo 2. Our work reveals how models like Evo 2 process biological information - from DNA to proteins - in ways we can now decode.

We are excited to announce our collaboration with <a href="/arcinstitute/">Arc Institute</a> on their state-of-the-art biological foundation model, Evo 2. Our work reveals how models like Evo 2 process biological information - from DNA to proteins - in ways we can now decode.
Owain Evans (@owainevans_uk) 's Twitter Profile Photo

Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵

Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, &amp; admires Nazis.

This is *emergent misalignment* &amp; we cannot fully explain it 🧵
OpenAI (@openai) 's Twitter Profile Photo

Detecting misbehavior in frontier reasoning models Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving

Detecting misbehavior in frontier reasoning models

Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving
Clément Dumas (at ICLR) (@butanium_) 's Twitter Profile Photo

New paper w/Julian Minder & Neel Nanda! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

New paper w/<a href="/jkminder/">Julian Minder</a> &amp; <a href="/NeelNanda5/">Neel Nanda</a>! What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders &amp; fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features! 🧵
Goodfire (@goodfireai) 's Twitter Profile Photo

What goes on inside the mind of a reasoning model? Today we're releasing the first open-source sparse autoencoders (SAEs) trained on DeepSeek's 671B parameter reasoning model, R1—giving us new tools to understand and steer model thinking. Why does this matter?

What goes on inside the mind of a reasoning model? Today we're releasing the first open-source sparse autoencoders (SAEs) trained on DeepSeek's 671B parameter reasoning model, R1—giving us new tools to understand and steer model thinking.

Why does this matter?
David Bau (@davidbau) 's Twitter Profile Photo

ACADEMICS: it is time to get our heads out of our *sses. This is not the moment for personal ambition, why your latest sophisticated widget beats rivals intricate theorem. The scientific franchise is under attack. It is time to defend it to the public. x.com/davidbau/statu…

Michael Hanna (@michaelwhanna) 's Twitter Profile Photo

Mateusz and I are excited to announce circuit-tracer, a library that makes circuit-finding simple! Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on neuronpedia: shorturl.at/SUX2A

<a href="/mntssys/">Mateusz</a> and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!

Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on <a href="/neuronpedia/">neuronpedia</a>: shorturl.at/SUX2A
Tal Linzen (@tallinzen) 's Twitter Profile Photo

International students, and Chinese students in particular, are essential to the AI research ecosystem in the US. You can't say you support AI research in this country and then threaten to revoke Chinese students' visas.

Can Rager (@can_rager) 's Twitter Profile Photo

Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

Can we uncover the list of topics a language model is censored on?

Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:
Miles Wang (@mileskwang) 's Twitter Profile Photo

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more

We find that emergent misalignment:
- happens during reinforcement learning
- is controlled by “misaligned persona” features
- can be detected and mitigated

🧵:
Neel Nanda (@neelnanda5) 's Twitter Profile Photo

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open! Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome We welcome any works that further our ability to use the internals of a model to better understand it Details: mechinterpworkshop com

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open!

Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome

We welcome any works that further our ability to use the internals of a model to better understand it

Details: mechinterpworkshop com
Miles Turpin (@milesaturpin) 's Twitter Profile Photo

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

New @Scale_AI paper! 🌟

LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).
Aryo Pradipta Gema (@aryopg) 's Twitter Profile Photo

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵

New Anthropic Research: “Inverse Scaling in Test-Time Compute”

We found cases where longer reasoning leads to lower accuracy.
Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns.

🧵