Yi Zeng 曾祎 (@easonzeng623) 's Twitter Profile
Yi Zeng 曾祎

@easonzeng623

probe to improve @VirtueAI_co | Ph.D. @VTEngineering | Amazon Research Fellow | #AI_safety 🦺 #AI_security 🛡 | I deal with the dark side of machine learning.

ID: 901961911448723456

linkhttps://www.yi-zeng.com calendar_today28-08-2017 00:17:32

563 Tweet

1,1K Followers

1,1K Following

Jia-Bin Huang (@jbhuang0604) 's Twitter Profile Photo

Wow!! 🤯🤯🤯 Openings of *30* tenured and/or tenure-track faculty positions in Artificial Intelligence! ejobs.umd.edu/postings/124144

Yujin Potter (@yujink_) 's Twitter Profile Photo

How will LLMs reshape our democracy? Recent work including ours has started exploring this important question. We recently wrote a blog (future-of-democracy-with-llm.org) regarding this! (summary in thread) Dawn Song, David G. Rand @dgrand.bsky.social, Yejin Choi 1/N

How will LLMs reshape our democracy? 
Recent work including ours has started exploring this important question. We recently wrote a blog (future-of-democracy-with-llm.org) regarding this! (summary in thread) <a href="/dawnsongtweets/">Dawn Song</a>, <a href="/DG_Rand/">David G. Rand @dgrand.bsky.social</a>, <a href="/YejinChoinka/">Yejin Choi</a>
1/N
𝕏un (@xun_aq) 's Twitter Profile Photo

Code agents are great, but not risk-free in code execution and generation! 🎯 We propose RedCode, an evaluation platform to comprehensively evaluate code agents in terms of risky code execution and generation. 📅 Catch our #NeurIPS2024 poster session tomorrow (12/12) afternoon

Code agents are great, but not risk-free in code execution and generation!

🎯 We propose RedCode, an evaluation platform to comprehensively evaluate code agents in terms of risky code execution and generation.

📅 Catch our #NeurIPS2024 poster session tomorrow (12/12) afternoon
𝕏un (@xun_aq) 's Twitter Profile Photo

For more details, please visit our paper at arxiv.org/abs/2411.07781. I'm Xun Liu, a senior undergraduate student advised by Prof. Bo Li Bo Li. Thanks for our great collaborators Chengquan Chengquan Guo, Chulin Chulin Xie, Andy @andyz245, Yi Yi Zeng 曾祎, Zinan

Luke Bailey (@lukebailey181) 's Twitter Profile Photo

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

New Anthropic research: Alignment faking in large language models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
Rafael Rafailov @ NeurIPS (@rm_rafailov) 's Twitter Profile Photo

We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Open Problems in Mechanistic Interpretability This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

Open Problems in Mechanistic Interpretability

This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
Lee Sharkey (@leedsharkey) 's Twitter Profile Photo

Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp. It highlights the open problems that we think the field should prioritize! 🧵

Big new review! 

🟦Open Problems in Mechanistic Interpretability🟦

We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp.

It highlights the open problems that we think the field should prioritize! 🧵
Tomek Korbak (@tomekkorbak) 's Twitter Profile Photo

🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrating that these measures are sufficient? Our new paper from @AISafetyInst and Redwood Research sketches a part of an AI control safety case in detail, proposing an

🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrating that these measures are sufficient? Our new paper from @AISafetyInst and <a href="/redwood_ai/">Redwood Research</a> sketches a part of an AI control safety case in detail, proposing an
Aryaman Arora (@aryaman2020) 's Twitter Profile Photo

new paper! 🫡 we introduce 🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering. we find that: 🥇prompting and finetuning are still best 🥈supervised interp methods are effective 😮SAEs lag behind

new paper! 🫡

we introduce  🪓AxBench, a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering.

we find that:
🥇prompting and finetuning are still best
🥈supervised interp methods are effective
😮SAEs lag behind
Aleksander Madry (@aleks_madry) 's Twitter Profile Photo

Do current LLMs perform simple tasks (e.g., grade school math) reliably? We know they don't (is 9.9 larger than 9.11?), but why? Turns out that, for one reason, benchmarks are too noisy to pinpoint such lingering failures. w/ Josh Vendrow Eddie Vendrow Sara Beery 1/5

Do current LLMs perform simple tasks (e.g., grade school math) reliably?

We know they don't (is 9.9 larger than 9.11?), but why?

Turns out that, for one reason, benchmarks are too noisy to pinpoint such lingering failures.

w/ <a href="/josh_vendrow/">Josh Vendrow</a> <a href="/EdwardVendrow/">Eddie Vendrow</a> <a href="/sarameghanbeery/">Sara Beery</a>
1/5
Yihe Deng (@yihe__deng) 's Twitter Profile Photo

New paper & model release! Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, showcasing our new DuoGuard-0.5B model. - Model: huggingface.co/DuoGuard/DuoGu… - Paper: arxiv.org/abs/2502.05163 - GitHub: github.com/yihedeng9/DuoG… Grounded in a

New paper &amp; model release!

Excited to introduce DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, showcasing our new DuoGuard-0.5B model.

- Model: huggingface.co/DuoGuard/DuoGu…
- Paper: arxiv.org/abs/2502.05163
- GitHub: github.com/yihedeng9/DuoG…

Grounded in a
Dylan Sam (@dylanjsam) 's Twitter Profile Photo

Excited to share new work from my internship Google AI ! Curious as to how we should measure the similarity between examples in pretraining datasets? We study the role of similarity in pretraining 1.7B parameter language models on the Pile. arxiv: arxiv.org/abs/2502.02494 1/🧵

Excited to share new work from my internship <a href="/GoogleAI/">Google AI</a> !

Curious as to how we should measure the similarity between examples in pretraining datasets? We study the role of similarity in pretraining 1.7B parameter language models on the Pile.

arxiv: arxiv.org/abs/2502.02494

1/🧵
Josh Engels (@joshaengels) 's Twitter Profile Photo

1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regimes (e.g. data scarcity). But we find that SAE probes consistently underperform! Our takeaway: mech interp should use stronger baselines to measure progress 🧵

1/14: If sparse autoencoders work, they should give us interpretable classifiers that help with probing in difficult regimes (e.g. data scarcity). But we find that SAE probes consistently underperform! Our takeaway: mech interp should use stronger baselines to measure progress 🧵
Dawn Song (@dawnsongtweets) 's Twitter Profile Photo

🚀 Really excited to launch #AgentX competition hosted by UC Berkeley RDI UC Berkeley alongside our LLM Agents MOOC series (a global community of 22k+ learners & growing fast). Whether you're building the next disruptive AI startup or pushing the research frontier, AgentX is your

🚀 Really excited to launch #AgentX competition hosted by <a href="/BerkeleyRDI/">UC Berkeley RDI</a> <a href="/UCBerkeley/">UC Berkeley</a> alongside our LLM Agents MOOC series (a global community of 22k+ learners &amp; growing fast). Whether you're building the next disruptive AI startup or pushing the research frontier, AgentX is your
Yi Zeng 曾祎 (@easonzeng623) 's Twitter Profile Photo

AIR-Bench is a Spotlight ICLR 2026 2025! Catch our poster on Fri, Apr 26, 10 a.m.–12:30 p.m. SGT (Poster Session 5). Sadly, I won’t be there in person (visa woes, again), but the insights—and our incredible team—will be with you in Singapore. Go say hi 👋

AIR-Bench is a Spotlight <a href="/iclr_conf/">ICLR 2026</a> 2025!  

Catch our poster on Fri, Apr 26, 10 a.m.–12:30 p.m. SGT (Poster Session 5).  

Sadly, I won’t be there in person (visa woes, again), but the insights—and our incredible team—will be with you in Singapore.  

Go say hi 👋