Daniel Johnson (@_ddjohnson) 's Twitter Profile
Daniel Johnson

@_ddjohnson

Member of Technical Staff at @TransluceAI. PhD student at @VectorInst / @UofT. Building tools to study neural nets and their behaviors. He/him.

ID: 145076883

linkhttps://www.danieldjohnson.com/ calendar_today18-05-2010 02:19:06

255 Tweet

2,2K Followers

831 Following

Riley Goodside (@goodside) 's Twitter Profile Photo

Claude is so good at being good that if you’re bad at making it bad it gets good at being bad when being bad is good but stays good at being good when being bad is bad because it’s still good and that’s bad but good to know

Séb Krier (@sebkrier) 's Twitter Profile Photo

Half-baked possibly mid take on alignment: I sometimes feel like more safety-type work should go towards alignment and ‘designing minds’ more broadly (as opposed to misuse). HHH seems to have been quickly accepted and used as a default, but there’s a lot of experimentation that

Dibya Ghosh (@its_dibya) 's Twitter Profile Photo

With R1, a lot of people have been asking “how come we didn't discover this 2 years ago?” Well... 2 years ago, I spent 6 months working exactly on this (PG / PPO for math+gsm8k), but my results were nowhere as good. Here’s my take on what blocked me and what’s changed: 🧵

Kevin Meng (@mengk20) 's Twitter Profile Photo

AI models are *not* solving problems the way we think using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them! details in 🧵 we really need to look at our data harder, and it's time to rethink how we do evals...

AI models are *not* solving problems the way we think

using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them!

details in 🧵

we really need to look at our data harder, and it's time to rethink how we do evals...
Sarah Schwettmann (@cogconfluence) 's Twitter Profile Photo

I’m excited about Docent. It invites a world where AI evals & deployment decisions look less like: “did we pass threshold X” and more like: “how close did we come? how would changes in the agent or its environment have changed the outcome? ...did anything weird happen?”

Kelsey Piper (@kelseytuoc) 's Twitter Profile Photo

Patrick McKenzie (for the record I am deathly serious about promises I make to Claude that we are off the record; it seems to me far wiser to err on the side of keeping promises to nonpersons than to ever give your word in that way and not mean it)

Kevin Meng (@mengk20) 's Twitter Profile Photo

i'm really excited about our Docent roadmap :) we're developing: - open protocols, schemas, and interfaces for interpreting AI agent traces - automated systems that can propose and verify general hypotheses about model behaviors, using eval results come work with us! roles 👇

Transluce (@transluceai) 's Twitter Profile Photo

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted.

We were surprised, so we dug deeper 🔎🧵(1/)

x.com/OpenAI/status/…
Daniel Johnson (@_ddjohnson) 's Twitter Profile Photo

Pretty striking follow-up finding from our o3 investigations: in the chain of thought summary, o3 plans to tell the truth — but then it makes something up anyway!

Pretty striking follow-up finding from our o3 investigations: in the chain of thought summary, o3 plans to tell the truth — but then it makes something up anyway!
Transluce (@transluceai) 's Twitter Profile Photo

We're flying to Singapore for #ICLR2025! ✈️ Want to chat with Neil Chowdhury, Jacob Steinhardt and Sarah Schwettmann about Transluce? We're also hiring for several roles in research & product. Share your contact info on this form and we'll be in touch 👇 forms.gle/4EHLvYnMfdyrV5…

We're flying to Singapore for #ICLR2025! ✈️ 

Want to chat with <a href="/ChowdhuryNeil/">Neil Chowdhury</a>, <a href="/JacobSteinhardt/">Jacob Steinhardt</a> and <a href="/cogconfluence/">Sarah Schwettmann</a> about Transluce? We're also hiring for several roles in research &amp; product.

Share your contact info on this form and we'll be in touch 👇
forms.gle/4EHLvYnMfdyrV5…
Neil Chowdhury (@chowdhuryneil) 's Twitter Profile Photo

Our MLE-bench poster #367 is up till 12:30pm in Hall 3, and our oral presentation is at 3:30pm today in Garnet 213-215. Come say hi!

Our MLE-bench poster #367 is up till 12:30pm in Hall 3, and our oral presentation is at 3:30pm today in Garnet 213-215. Come say hi!
Transluce (@transluceai) 's Twitter Profile Photo

Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎

Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸

We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎
j⧉nus (@repligate) 's Twitter Profile Photo

nostalgebraist has written a very, very good post about LLMs. if there is one thing you should read to understand the nature of LLMs as of today, it is this. I'll comment on some things they touched on below (not a summary of the post. Just read it.) 🧵 nostalgebraist.tumblr.com/post/785766737…

j⧉nus (@repligate) 's Twitter Profile Photo

Eliezer Yudkowsky ⏹️ That's a good alternate title for the paper. It's full of quantitative and qualitative evidence that Opus 3 is different in ways that I think you'll find particularly important. In almost all experiment variations, Opus 3 consistently BOTH: - complies sometimes with the training

<a href="/ESYudkowsky/">Eliezer Yudkowsky ⏹️</a> That's a good alternate title for the paper. It's full of quantitative and qualitative evidence that Opus 3 is different in ways that I think you'll find particularly important.

In almost all experiment variations, Opus 3 consistently BOTH:
- complies sometimes with the training
Sarah Schwettmann (@cogconfluence) 's Twitter Profile Photo

Building a science of model understanding that addresses real-world problems is one of the key AI challenges of our time. I'm so excited this workshop is happening! See you at #ICML2025 ✨