Xianjun Yang (@xianjun_agi) Twitter Tweets • TwiCopy

Xianjun Yang

@xianjun_agi

+ Follow

RS @AIatMeta. GenAI safety, data-centric AI. Previously Phd @ucsbnlp, BEng @tsinghua_uni. Opinions are my own.

All Watched Over by Machines of Loving Grace.

ID: 1224810594882113536

linkhttps://xianjun-yang.github.io/ calendar_today04-02-2020 21:43:09

335 Tweet

886 Followers

1,1K Following

Prateek Yadav

@prateeky2806

a year ago

Ever wondered if model merging works at scale? Maybe the benefits wear off for bigger models? Maybe you considered using model merging for post-training of your large model but not sure if it generalizes well? cc: Google AI Google DeepMind UNC NLP 🧵👇 Excited to announce my

thumb_up_off_alt372

chat_bubble_outline6

repeat84

shareShare

Anthropic

@anthropicai

6 months ago

New Anthropic research: Forecasting rare language model behaviors. We forecast whether risks will occur after a model is deployed—using even very limited sets of test data.

thumb_up_off_alt1,1K

chat_bubble_outline91

repeat148

shareShare

Sagnik Mukherjee

@saagnikkk

4 months ago

🚨 Paper Alert: “RL Finetunes Small Subnetworks in Large Language Models” From DeepSeek V3 Base to DeepSeek R1 Zero, a whopping 86% of parameters were NOT updated during RL training 😮😮 And this isn’t a one-off. The pattern holds across RL algorithms and models. 🧵A Deep Dive

thumb_up_off_alt844

chat_bubble_outline17

repeat125

shareShare

Sonia

@soniajoseph_

3 months ago

Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mechanistic Interpretability for Vision Workshop at CVPR 2025! 🎉 We’ll be in Nashville next week. Come say hi 👋 #CVPR2025 Mechanistic Interpretability for Vision @ CVPR2025

thumb_up_off_alt288

chat_bubble_outline3

repeat31

shareShare

Ekdeep Singh Lubana

@ekdeepl

3 months ago

🚨 New paper alert! Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔 1/11

thumb_up_off_alt378

chat_bubble_outline5

repeat60

shareShare

Mir Miroyan

@mirmiroyan

3 months ago

We release Search Arena 🌐 — the first large-scale (24k+) dataset of in-the-wild user interactions with search-augmented LLMs. We also share a comprehensive report on user preferences and model performance in the search-enabled setting. Paper, dataset, and code in 🧵

thumb_up_off_alt217

chat_bubble_outline5

repeat39

shareShare

Zifan (Sail) Wang

@_zifan_wang

3 months ago

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and

thumb_up_off_alt79

chat_bubble_outline4

repeat20

shareShare

Xianjun Yang

@xianjun_agi

3 months ago

cool!

thumb_up_off_alt1

chat_bubble_outline1

repeat0

shareShare

Rico Angell

@rico_angell

2 months ago

What causes jailbreaks to transfer between LLMs? We find that jailbreak strength and model representation similarity predict transferability, and we can engineer model similarity to improve transfer. Details in🧵

thumb_up_off_alt51

chat_bubble_outline3

repeat11

shareShare

Paul Bogdan

@paulcbogdan

2 months ago

New paper: What happens when an LLM reasons? We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵

thumb_up_off_alt771

chat_bubble_outline17

repeat150

shareShare

Xianjun Yang

@xianjun_agi

2 months ago

Great chance to explore science!

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Xianjun Yang

@xianjun_agi

2 months ago

Cool

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Fazl Barez

@fazlbarez

2 months ago

Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵

thumb_up_off_alt588

chat_bubble_outline19

repeat119

shareShare

Valentina Pyatkin

@valentina__py

2 months ago

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

thumb_up_off_alt347

chat_bubble_outline5

repeat89

shareShare

Dr. Karen Ullrich

@karen_ullrich

2 months ago

How would you make an LLM "forget" the concept of dog — or any other arbitrary concept? 🐶❓ We introduce SAMD & SAMI — a novel, concept-agnostic approach to identify and manipulate attention modules in transformers.

thumb_up_off_alt77

chat_bubble_outline3

repeat12

shareShare

Kaiqu Liang

@kaiqu_liang

2 months ago

🤔 Feel like your AI is bullshitting you? It’s not just you. 🚨 We quantified machine bullshit 💩 Turns out, aligning LLMs to be "helpful" via human feedback actually teaches them to bullshit—and Chain-of-Thought reasoning just makes it worse! 🔥 Time to rethink AI alignment.

thumb_up_off_alt454

chat_bubble_outline8

repeat70

shareShare

Keyon Vafa

@keyonv

2 months ago

Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵

thumb_up_off_alt6,6K

chat_bubble_outline198

repeat938

shareShare

Parmita Mishra

@prmshra

2 months ago

I simply do not understand why no company other than openAI is very seriously focusing on memory/personalization it’s the main reason I use openAI what shocks me is that barring grok (which has context of my tweets now) there’s no other AI company that is even trying to

thumb_up_off_alt1,1K

chat_bubble_outline213

repeat62

shareShare