Xianjun Yang (@xianjun_agi) 's Twitter Profile
Xianjun Yang

@xianjun_agi

RS @AIatMeta. GenAI safety, data-centric AI. Previously Phd @ucsbnlp, BEng @tsinghua_uni. Opinions are my own.

All Watched Over by Machines of Loving Grace.

ID: 1224810594882113536

linkhttps://xianjun-yang.github.io/ calendar_today04-02-2020 21:43:09

335 Tweet

886 Followers

1,1K Following

Prateek Yadav (@prateeky2806) 's Twitter Profile Photo

Ever wondered if model merging works at scale? Maybe the benefits wear off for bigger models? Maybe you considered using model merging for post-training of your large model but not sure if it generalizes well? cc: Google AI Google DeepMind UNC NLP 🧵👇 Excited to announce my

Ever wondered if model merging works at scale? Maybe the benefits wear off for bigger models?

Maybe you considered using model merging for post-training of your large model but not sure if it  generalizes well?

cc: <a href="/GoogleAI/">Google AI</a> <a href="/GoogleDeepMind/">Google DeepMind</a> <a href="/uncnlp/">UNC NLP</a>
🧵👇

Excited to announce my
Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Forecasting rare language model behaviors. We forecast whether risks will occur after a model is deployed—using even very limited sets of test data.

New Anthropic research: Forecasting rare language model behaviors.

We forecast whether risks will occur after a model is deployed—using even very limited sets of test data.
Sagnik Mukherjee (@saagnikkk) 's Twitter Profile Photo

🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models” From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮 And this isn’t a one-off. The pattern holds across RL algorithms and models. 🧵A Deep Dive

🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models”

From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮
And this isn’t a one-off. The pattern holds across RL algorithms and models.
🧵A Deep Dive
Sonia (@soniajoseph_) 's Twitter Profile Photo

Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉 We’ll be in Nashville next week. Come say hi 👋 #CVPR2025 Mechanistic Interpretability for Vision @ CVPR2025

Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉

We’ll be in Nashville next week. Come say hi 👋

<a href="/CVPR/">#CVPR2025</a>  <a href="/miv_cvpr2025/">Mechanistic Interpretability for Vision @ CVPR2025</a>
Ekdeep Singh Lubana (@ekdeepl) 's Twitter Profile Photo

🚨 New paper alert! Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔 1/11

Mir Miroyan (@mirmiroyan) 's Twitter Profile Photo

We release Search Arena 🌐 — the first large-scale (24k+) dataset of in-the-wild user interactions with search-augmented LLMs. We also share a comprehensive report on user preferences and model performance in the search-enabled setting. Paper, dataset, and code in 🧵

We release Search Arena 🌐 — the first large-scale (24k+) dataset of in-the-wild user interactions with search-augmented LLMs.

We also share a comprehensive report on user preferences and model performance in the search-enabled setting.

Paper, dataset, and code in 🧵
Zifan (Sail) Wang (@_zifan_wang) 's Twitter Profile Photo

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML &amp; policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and
Rico Angell (@rico_angell) 's Twitter Profile Photo

What causes jailbreaks to transfer between LLMs? We find that jailbreak strength and model representation similarity predict transferability, and we can engineer model similarity to improve transfer. Details in🧵

What causes jailbreaks to transfer between LLMs?

We find that jailbreak strength and model representation similarity predict transferability, and we can engineer model similarity to improve transfer.

Details in🧵
Paul Bogdan (@paulcbogdan) 's Twitter Profile Photo

New paper: What happens when an LLM reasons? We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵

Fazl Barez (@fazlbarez) 's Twitter Profile Photo

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! 

We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
Valentina Pyatkin (@valentina__py) 's Twitter Profile Photo

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR.
But the set of constraints and verifier functions is limited and most models overfit on IFEval.
We introduce IFBench to measure model generalization to unseen constraints.
Dr. Karen Ullrich (@karen_ullrich) 's Twitter Profile Photo

How would you make an LLM "forget" the concept of dog — or any other arbitrary concept? 🐶❓ We introduce SAMD & SAMI — a novel, concept-agnostic approach to identify and manipulate attention modules in transformers.

How would you make an LLM "forget" the concept of dog — or any other arbitrary concept? 🐶❓

We introduce SAMD &amp; SAMI — a novel, concept-agnostic approach to identify and manipulate attention modules in transformers.
Kaiqu Liang (@kaiqu_liang) 's Twitter Profile Photo

🤔 Feel like your AI is bullshitting you? It’s not just you. 🚨 We quantified machine bullshit 💩 Turns out, aligning LLMs to be "helpful" via human feedback actually teaches them to bullshit—and Chain-of-Thought reasoning just makes it worse! 🔥 Time to rethink AI alignment.

🤔 Feel like your AI is bullshitting you? It’s not just you.

🚨 We quantified machine bullshit 💩

Turns out, aligning LLMs to be "helpful" via human feedback actually teaches them to bullshit—and Chain-of-Thought reasoning just makes it worse!

🔥 Time to rethink AI alignment.
Keyon Vafa (@keyonv) 's Twitter Profile Photo

Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵

Parmita Mishra (@prmshra) 's Twitter Profile Photo

I simply do not understand why no company other than openAI is very seriously focusing on memory/personalization it’s the main reason I use openAI what shocks me is that barring grok (which has context of my tweets now) there’s no other AI company that is even trying to