Andy Arditi (@andyarditi) Twitter Tweets • TwiCopy

Atsushi Yamamura (山村篤志)

a year ago

Excited to share our latest work, "Fooling LLM graders into giving better grades through neural activity-guided adversarial prompting" (w/ Surya Ganguli)! We investigate distorting AI decision-making to build fair and robust AI judges/graders.arxiv.org/abs/2412.15275 #AISafety 1/n

thumb_up_off_alt28

chat_bubble_outline2

repeat4

shareShare

Andy Arditi

@andyarditi

10 months ago

Check out tom white's beautiful visualizations of LLM features!

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Andy Arditi

@andyarditi

9 months ago

Highly recommend working with Neel!

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Owain Evans

@owainevans_uk

9 months ago

New paper: Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵

thumb_up_off_alt452

chat_bubble_outline6

repeat78

shareShare

Goodfire

@goodfireai

9 months ago

We are excited to announce our collaboration with Arc Institute on their state-of-the-art biological foundation model, Evo 2. Our work reveals how models like Evo 2 process biological information - from DNA to proteins - in ways we can now decode.

We are excited to announce our collaboration with <a href="/arcinstitute/">Arc Institute</a> on their state-of-the-art biological foundation model, Evo 2. Our work reveals how models like Evo 2 process biological information - from DNA to proteins - in ways we can now decode.

thumb_up_off_alt400

chat_bubble_outline7

repeat43

shareShare

Owain Evans

@owainevans_uk

9 months ago

Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵

thumb_up_off_alt6,6K

chat_bubble_outline432

repeat984

shareShare

OpenAI

@openai

9 months ago

Detecting misbehavior in frontier reasoning models Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving

thumb_up_off_alt5,5K

chat_bubble_outline418

repeat751

shareShare

Clément Dumas (at ICLR)

@butanium_

8 months ago

New paper w/Julian Minder & Neel Nanda! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

New paper w/<a href="/jkminder/">Julian Minder</a> & <a href="/NeelNanda5/">Neel Nanda</a>! What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features! 🧵

thumb_up_off_alt187

chat_bubble_outline5

repeat28

shareShare

Goodfire

@goodfireai

7 months ago

What goes on inside the mind of a reasoning model? Today we're releasing the first open-source sparse autoencoders (SAEs) trained on DeepSeek's 671B parameter reasoning model, R1—giving us new tools to understand and steer model thinking. Why does this matter?

thumb_up_off_alt631

chat_bubble_outline20

repeat69

shareShare

David Bau

@davidbau

6 months ago

ACADEMICS: it is time to get our heads out of our *sses. This is not the moment for personal ambition, why your latest sophisticated widget beats rivals intricate theorem. The scientific franchise is under attack. It is time to defend it to the public. x.com/davidbau/statu…

thumb_up_off_alt142

chat_bubble_outline2

repeat21

shareShare

Michael Hanna

@michaelwhanna

6 months ago

Mateusz and I are excited to announce circuit-tracer, a library that makes circuit-finding simple! Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on neuronpedia: shorturl.at/SUX2A

<a href="/mntssys/">Mateusz</a> and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!

Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on <a href="/neuronpedia/">neuronpedia</a>: shorturl.at/SUX2A

thumb_up_off_alt199

chat_bubble_outline8

repeat45

shareShare

Tal Linzen

@tallinzen

6 months ago

International students, and Chinese students in particular, are essential to the AI research ecosystem in the US. You can't say you support AI research in this country and then threaten to revoke Chinese students' visas.

thumb_up_off_alt180

chat_bubble_outline10

repeat17

shareShare

Can Rager

@can_rager

5 months ago

Can we uncover the list of topics a language model is censored on? Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

thumb_up_off_alt84

chat_bubble_outline2

repeat22

shareShare

Miles Wang

@mileskwang

5 months ago

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

thumb_up_off_alt1,1K

chat_bubble_outline76

repeat144

shareShare

Neel Nanda

@neelnanda5

5 months ago

The call for papers for the NeurIPS Mechanistic Interpretability Workshop is open! Max 4 or 9 pages, due 22 Aug, NeurIPS submissions welcome We welcome any works that further our ability to use the internals of a model to better understand it Details: mechinterpworkshop com

thumb_up_off_alt221

chat_bubble_outline2

repeat29

shareShare

Miles Turpin

@milesaturpin

4 months ago

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

thumb_up_off_alt217

chat_bubble_outline7

repeat36

shareShare

Aryo Pradipta Gema

@aryopg

4 months ago

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵

thumb_up_off_alt989

chat_bubble_outline52

repeat133

shareShare

Jiachen Zhao

@jcz12856876

4 months ago

1/ 🚨New Paper 🚨 LLMs are trained to refuse harmful instructions, but internally, do they see harmfulness and refusal as the same? ⚔️We find causal evidence that 👈”LLMs encode harmfulness and refusal separately” 👉. ✂️LLMs may know a prompt is harmful internally yet still

thumb_up_off_alt49

chat_bubble_outline4

repeat13

shareShare

Andy Arditi

@andyarditi

4 months ago

🦉

thumb_up_off_alt8

chat_bubble_outline0

repeat0

shareShare

Helena Casademunt

@hcasademunt

4 months ago

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

thumb_up_off_alt115

chat_bubble_outline7

repeat24

shareShare