Yi Zeng 曾祎 (@easonzeng623) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Jia-Bin Huang

@jbhuang0604

9 months ago

Wow!! 🤯🤯🤯 Openings of *30* tenured and/or tenure-track faculty positions in Artificial Intelligence! ejobs.umd.edu/postings/124144

thumb_up_off_alt266

chat_bubble_outline7

repeat38

shareShare

Yi Zeng 曾祎

@easonzeng623

9 months ago

Paper 1530 here EMNLP 2025 . Seeing y’all soon 🤫

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

How will LLMs reshape our democracy? Recent work including ours has started exploring this important question. We recently wrote a blog (future-of-democracy-with-llm.org) regarding this! (summary in thread) Dawn Song, David G. Rand @dgrand.bsky.social, Yejin Choi 1/N

thumb_up_off_alt84

chat_bubble_outline4

repeat18

shareShare

𝕏un

@xun_aq

8 months ago

Code agents are great, but not risk-free in code execution and generation! 🎯 We propose RedCode, an evaluation platform to comprehensively evaluate code agents in terms of risky code execution and generation. 📅 Catch our #NeurIPS2024 poster session tomorrow (12/12) afternoon

thumb_up_off_alt25

chat_bubble_outline1

repeat10

shareShare

𝕏un

@xun_aq

8 months ago

For more details, please visit our paper at arxiv.org/abs/2411.07781. I'm Xun Liu, a senior undergraduate student advised by Prof. Bo Li Bo Li. Thanks for our great collaborators Chengquan Chengquan Guo, Chulin Chulin Xie, Andy @andyz245, Yi Yi Zeng 曾祎, Zinan

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Luke Bailey

@lukebailey181

8 months ago

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

thumb_up_off_alt366

chat_bubble_outline11

repeat83

shareShare

Anthropic

@anthropicai

8 months ago

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

thumb_up_off_alt4,4K

chat_bubble_outline212

repeat727

shareShare

Rafael Rafailov @ NeurIPS

@rm_rafailov

7 months ago

We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

thumb_up_off_alt1,1K

chat_bubble_outline24

repeat230

shareShare

Stephen McAleer

@mcaleerstephen

6 months ago

DeepSeek should create a preparedness framework/RSP if they continue to scale reasoning models.

thumb_up_off_alt251

chat_bubble_outline37

repeat13

shareShare

Aran Komatsuzaki

@arankomatsuzaki

6 months ago

Open Problems in Mechanistic Interpretability This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

thumb_up_off_alt292

chat_bubble_outline2

repeat39

shareShare

Lee Sharkey

@leedsharkey

6 months ago

Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp. It highlights the open problems that we think the field should prioritize! 🧵

thumb_up_off_alt549

chat_bubble_outline3

repeat94

shareShare

Tomek Korbak

@tomekkorbak

6 months ago

🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrating that these measures are sufficient? Our new paper from @AISafetyInst and Redwood Research sketches a part of an AI control safety case in detail, proposing an

thumb_up_off_alt136

chat_bubble_outline2

repeat36

shareShare

Aryaman Arora

@aryaman2020

6 months ago

new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering. we find that: 🥇prompting and finetuning are still best 🥈supervised interp methods are effective 😮SAEs lag behind

thumb_up_off_alt416

chat_bubble_outline7

repeat67

shareShare

Aleksander Madry

@aleks_madry

6 months ago

Do current LLMs perform simple tasks (e.g., grade school math) reliably? We know they don't (is 9.9 larger than 9.11?), but why? Turns out that, for one reason, benchmarks are too noisy to pinpoint such lingering failures. w/ Josh Vendrow Eddie Vendrow Sara Beery 1/5

thumb_up_off_alt241

chat_bubble_outline12

repeat48

shareShare

Yihe Deng

@yihe__deng

6 months ago

New paper & model release! Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, showcasing our new DuoGuard-0.5B model. - Model: huggingface.co/DuoGuard/DuoGu… - Paper: arxiv.org/abs/2502.05163 - GitHub: github.com/yihedeng9/DuoG… Grounded in a

thumb_up_off_alt133

chat_bubble_outline2

repeat30

shareShare

Dylan Sam

@dylanjsam

6 months ago

Excited to share new work from my internship Google AI ! Curious as to how we should measure the similarity between examples in pretraining datasets? We study the role of similarity in pretraining 1.7B parameter language models on the Pile. arxiv: arxiv.org/abs/2502.02494 1/🧵

Excited to share new work from my internship <a href="/GoogleAI/">Google AI</a> !

Curious as to how we should measure the similarity between examples in pretraining datasets? We study the role of similarity in pretraining 1.7B parameter language models on the Pile.

arxiv: arxiv.org/abs/2502.02494

1/🧵

thumb_up_off_alt169

chat_bubble_outline5

repeat41

shareShare

Josh Engels

@joshaengels

5 months ago

1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regimes (e.g. data scarcity). But we find that SAE probes consistently underperform! Our takeaway: mech interp should use stronger baselines to measure progress 🧵

thumb_up_off_alt522

chat_bubble_outline13

repeat60

shareShare

Kenneth Li

@ke_li_2021

5 months ago

I’m on the job market! Check out my homepage (likenneth.github.io) and hit me up if you’re interested!

thumb_up_off_alt50

chat_bubble_outline1

repeat13

shareShare

Dawn Song

@dawnsongtweets

5 months ago

🚀 Really excited to launch #AgentX competition hosted by UC Berkeley RDI UC Berkeley alongside our LLM Agents MOOC series (a global community of 22k+ learners & growing fast). Whether you're building the next disruptive AI startup or pushing the research frontier, AgentX is your

🚀 Really excited to launch #AgentX competition hosted by <a href="/BerkeleyRDI/">UC Berkeley RDI</a> <a href="/UCBerkeley/">UC Berkeley</a> alongside our LLM Agents MOOC series (a global community of 22k+ learners & growing fast). Whether you're building the next disruptive AI startup or pushing the research frontier, AgentX is your

thumb_up_off_alt410

chat_bubble_outline20

repeat108

shareShare

Yi Zeng 曾祎

@easonzeng623

4 months ago

AIR-Bench is a Spotlight ICLR 2026 2025! Catch our poster on Fri, Apr 26, 10 a.m.–12:30 p.m. SGT (Poster Session 5). Sadly, I won’t be there in person (visa woes, again), but the insights—and our incredible team—will be with you in Singapore. Go say hi 👋

AIR-Bench is a Spotlight <a href="/iclr_conf/">ICLR 2026</a> 2025!

Catch our poster on Fri, Apr 26, 10 a.m.–12:30 p.m. SGT (Poster Session 5).

Sadly, I won’t be there in person (visa woes, again), but the insights—and our incredible team—will be with you in Singapore.

Go say hi 👋

thumb_up_off_alt23

chat_bubble_outline0

repeat4

shareShare

Yi Zeng 曾祎

Gate.io

Jia-Bin Huang

Yi Zeng 曾祎

Yujin Potter

𝕏un

𝕏un

Luke Bailey

Anthropic

Rafael Rafailov @ NeurIPS

Stephen McAleer

Aran Komatsuzaki

Lee Sharkey

Tomek Korbak

Aryaman Arora

Aleksander Madry

Yihe Deng

Dylan Sam

Josh Engels

Kenneth Li

Dawn Song

Yi Zeng 曾祎