Ben Zhou (@benzhou96) Twitter Tweets • TwiCopy

Nan Xu

2 years ago

🚨New paper, Cognitive Overload to #jailbreak #LLM, accepted by #NAACL2024 Findings LLMs are vulnerable to our black-box attacks: 😞multilingual cognitive overload ☹️ veiled expression 😣effect-to-cause reasoning Existing defense strategies🤯... arxiv.org/abs/2311.09827

thumb_up_off_alt39

chat_bubble_outline3

repeat16

shareShare

Bangzheng Li

@bangzhengl

2 years ago

During a long-chained reasoning process... 🤔Are LLMs following correct reasoning paths or relying on semantic shortcuts? 😵‍💫 What contributes to the failure of LLMs? 🛒Will prompting/RAG solve it? Check out our paper, which is accepted by #NAACL2024 at: vincentleebang.github.io/eureqa.github.…

thumb_up_off_alt16

chat_bubble_outline1

repeat8

shareShare

Xingyu Fu

@xingyufu2

2 years ago

Can GPT-4V and Gemini-Pro perceive the world the way humans do? 🤔 Can they solve the vision tasks that humans can in the blink of an eye? 😉 tldr; NO, they are far worse than us 💁🏻‍♀️ Introducing BLINK👁 zeyofu.github.io/blink/, a novel benchmark that studies visual perception

thumb_up_off_alt410

chat_bubble_outline9

repeat126

shareShare

Xingyu Fu

@xingyufu2

a year ago

Can Text-to-Image models understand common sense? 🤔 Can they generate images that fit everyday common sense? 🤔 tldr; NO, they are far less intelligent than us 💁🏻‍♀️ Introducing Commonsense-T2I 💡 zeyofu.github.io/CommonsenseT2I/, a novel evaluation and benchmark designed to measure

thumb_up_off_alt132

chat_bubble_outline7

repeat39

shareShare

Zhikun Xu

@jerrrykun

10 months ago

🚀 Exciting news! Our paper, “ToW: Thoughts of Words Improve Reasoning in Large Language Models”, got accepted at NAACL 2025 Main Conference🎉! #NAACL #NAACL2025 💡What if LMs didn’t just predict words—but actually reason about them? That’s where ToW comes in. 🧵👇

thumb_up_off_alt11

chat_bubble_outline1

repeat3

shareShare

Ben Zhou

@benzhou96

10 months ago

Excited to have my first student publication since I took the faculty position at ASU. I truly believe this work's general direction, which encourages trial-and-error generalization by removing essential information, will make a more practical impact soon.

thumb_up_off_alt28

chat_bubble_outline0

repeat1

shareShare

Zhikun Xu

@jerrrykun

9 months ago

🚨 LLMs can generate math proofs and solve competition-level math problems, but do they truly understand them? 🚀Introducing CounterMATH, a benchmark for assessing LLMs’ mathematical reasoning via counterexample-based proofs. 📑Paper link: arxiv.org/abs/2502.10454 🧵Read on!⬇️

thumb_up_off_alt5

chat_bubble_outline1

repeat6

shareShare

Yu Feng

@anniefeng6

7 months ago

#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty 😵‍💫 — largely because they can't reliably estimate the probability of each choice. We propose BIRD 🐦, a framework that significantly enhances LLM decision making under uncertainty. BIRD

thumb_up_off_alt256

chat_bubble_outline2

repeat38

shareShare

jakedineenasu

@jakedineenasu

5 months ago

🔍 Introducing QA-LIGN: A reflective alignment approach using a draft→reflection→revision pipeline. We create symbolic reward models that serve as both natural language critics & general reward models, bridging rule-based rewards and RLAIF. 📄 Paper: arxiv.org/pdf/2506.08123

thumb_up_off_alt4

chat_bubble_outline1

repeat6

shareShare

Aswin RRV

@aswinrrv

3 months ago

🎉𝐄𝐱𝐜𝐢𝐭𝐞𝐝 𝐭𝐨 𝐬𝐡𝐚𝐫𝐞 𝐭𝐡𝐚𝐭 𝐨𝐮𝐫 𝐧𝐞𝐰 𝐩𝐚𝐩𝐞𝐫, "ThinkTuning", 𝐢𝐬 𝐧𝐨𝐰 𝐨𝐮𝐭! 🎉 🔄RL merely draws out behaviors already present in the base models. Sophisticated thinking behaviors like self-reflection, self-correction and other multi-step reasoning

thumb_up_off_alt9

chat_bubble_outline1

repeat3

shareShare

Ben Zhou

@benzhou96

3 months ago

Now accepted to #EMNLP2025! Check out our work on instilling “thinking behaviors” to “non-thinking models” without any distillation from thinking models!

thumb_up_off_alt13

chat_bubble_outline1

repeat1

shareShare

Dongwon Jung

@dong_w0n

3 months ago

Excited to share that two of my first-author papers were accepted to #EMNLP2025! ✨📚 1️⃣ Code Execution as Grounded Supervision for LLM Reasoning (Main) 2️⃣ Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation (Findings) Huge thanks to my collaborators🙌

thumb_up_off_alt12

chat_bubble_outline2

repeat6

shareShare

jakedineenasu

@jakedineenasu

3 months ago

Thrilled to share QA-LIGN 𝐚𝐭 #EMNLP2025! Bridging rule-based rewards and LLM-as-a-Judge via LLM-derived symbolic reward rubrics. 🔗 arxiv.org/pdf/2506.08123

thumb_up_off_alt8

chat_bubble_outline0

repeat3

shareShare

Ben Zhou

@benzhou96

3 months ago

In addition, we show a rubric-based reward can help steering models’ thinking/reflection process, potentially key to future directions on infusing constraints in reasoning.

thumb_up_off_alt12

chat_bubble_outline0

repeat2

shareShare

Ben Zhou

@benzhou96

3 months ago

This highlights the need of bridging the gap between AI research and downstream research.

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare