Ben Zhou (@benzhou96) 's Twitter Profile
Ben Zhou

@benzhou96

Assistant Professor @SCAI_ASU Also known as Xuanyu Zhou

ID: 1033928440645382149

linkhttp://xuanyu.me calendar_today27-08-2018 04:05:24

27 Tweet

310 Followers

248 Following

Nan Xu (@xunannancy) 's Twitter Profile Photo

๐ŸšจNew paper, Cognitive Overload to #jailbreak #LLM, accepted by #NAACL2024 Findings LLMs are vulnerable to our black-box attacks: ๐Ÿ˜žmultilingual cognitive overload โ˜น๏ธ veiled expression ๐Ÿ˜ฃeffect-to-cause reasoning Existing defense strategies๐Ÿคฏ... arxiv.org/abs/2311.09827

๐ŸšจNew paper, Cognitive Overload to #jailbreak #LLM, accepted by #NAACL2024 Findings

LLMs are vulnerable to our black-box attacks:
๐Ÿ˜žmultilingual cognitive overload 
โ˜น๏ธ veiled expression
๐Ÿ˜ฃeffect-to-cause reasoning 

Existing defense strategies๐Ÿคฏ...

arxiv.org/abs/2311.09827
Bangzheng Li (@bangzhengl) 's Twitter Profile Photo

During a long-chained reasoning process... ๐Ÿค”Are LLMs following correct reasoning paths or relying on semantic shortcuts? ๐Ÿ˜ตโ€๐Ÿ’ซ What contributes to the failure of LLMs? ๐Ÿ›’Will prompting/RAG solve it? Check out our paper, which is accepted by #NAACL2024 at: vincentleebang.github.io/eureqa.github.โ€ฆ

During a long-chained reasoning process...
๐Ÿค”Are LLMs following correct reasoning paths or relying on semantic shortcuts? 
๐Ÿ˜ตโ€๐Ÿ’ซ What contributes to the failure of LLMs?
๐Ÿ›’Will prompting/RAG solve it?

Check out our paper, which is accepted by #NAACL2024 at: vincentleebang.github.io/eureqa.github.โ€ฆ
Xingyu Fu (@xingyufu2) 's Twitter Profile Photo

Can GPT-4V and Gemini-Pro perceive the world the way humans do? ๐Ÿค” Can they solve the vision tasks that humans can in the blink of an eye? ๐Ÿ˜‰ tldr; NO, they are far worse than us ๐Ÿ’๐Ÿปโ€โ™€๏ธ Introducing BLINK๐Ÿ‘ zeyofu.github.io/blink/, a novel benchmark that studies visual perception

Can GPT-4V and Gemini-Pro perceive the world the way humans do? ๐Ÿค”

Can they solve the vision tasks that humans can in the blink of an eye? ๐Ÿ˜‰

tldr; NO, they are far worse than us ๐Ÿ’๐Ÿปโ€โ™€๏ธ

Introducing BLINK๐Ÿ‘ zeyofu.github.io/blink/, a novel benchmark that studies visual perception
Xingyu Fu (@xingyufu2) 's Twitter Profile Photo

Can Text-to-Image models understand common sense? ๐Ÿค” Can they generate images that fit everyday common sense? ๐Ÿค” tldr; NO, they are far less intelligent than us ๐Ÿ’๐Ÿปโ€โ™€๏ธ Introducing Commonsense-T2I ๐Ÿ’ก zeyofu.github.io/CommonsenseT2I/, a novel evaluation and benchmark designed to measure

Can Text-to-Image models understand common sense? ๐Ÿค”

Can they generate images that fit everyday common sense? ๐Ÿค”

tldr; NO, they are far less intelligent than us ๐Ÿ’๐Ÿปโ€โ™€๏ธ

Introducing Commonsense-T2I ๐Ÿ’ก zeyofu.github.io/CommonsenseT2I/, a novel evaluation and benchmark designed to measure
Zhikun Xu (@jerrrykun) 's Twitter Profile Photo

๐Ÿš€ Exciting news! Our paper, โ€œToW: Thoughts of Words Improve Reasoning in Large Language Modelsโ€, got accepted at NAACL 2025 Main Conference๐ŸŽ‰! #NAACL #NAACL2025 ๐Ÿ’กWhat if LMs didnโ€™t just predict wordsโ€”but actually reason about them? Thatโ€™s where ToW comes in. ๐Ÿงต๐Ÿ‘‡

Ben Zhou (@benzhou96) 's Twitter Profile Photo

Excited to have my first student publication since I took the faculty position at ASU. I truly believe this work's general direction, which encourages trial-and-error generalization by removing essential information, will make a more practical impact soon.

Zhikun Xu (@jerrrykun) 's Twitter Profile Photo

๐Ÿšจ LLMs can generate math proofs and solve competition-level math problems, but do they truly understand them? ๐Ÿš€Introducing CounterMATH, a benchmark for assessing LLMsโ€™ mathematical reasoning via counterexample-based proofs. ๐Ÿ“‘Paper link: arxiv.org/abs/2502.10454 ๐ŸงตRead on!โฌ‡๏ธ

Yu Feng (@anniefeng6) 's Twitter Profile Photo

#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty ๐Ÿ˜ตโ€๐Ÿ’ซ โ€” largely because they can't reliably estimate the probability of each choice. We propose BIRD ๐Ÿฆ, a framework that significantly enhances LLM decision making under uncertainty. BIRD

#ICLR2025 Oral

LLMs often struggle with reliable and consistent decisions under uncertainty ๐Ÿ˜ตโ€๐Ÿ’ซ โ€” largely because they can't reliably estimate the probability of each choice.

We propose BIRD ๐Ÿฆ, a framework that significantly enhances LLM decision making under uncertainty.

BIRD
jakedineenasu (@jakedineenasu) 's Twitter Profile Photo

๐Ÿ” Introducing QA-LIGN: A reflective alignment approach using a draftโ†’reflectionโ†’revision pipeline. We create symbolic reward models that serve as both natural language critics & general reward models, bridging rule-based rewards and RLAIF. ๐Ÿ“„ Paper: arxiv.org/pdf/2506.08123

๐Ÿ” Introducing QA-LIGN: A reflective alignment approach using a draftโ†’reflectionโ†’revision pipeline. We create symbolic reward models that serve as both natural language critics & general reward models, bridging rule-based rewards and RLAIF.

๐Ÿ“„ Paper: arxiv.org/pdf/2506.08123
Aswin RRV (@aswinrrv) 's Twitter Profile Photo

๐ŸŽ‰๐„๐ฑ๐œ๐ข๐ญ๐ž๐ ๐ญ๐จ ๐ฌ๐ก๐š๐ซ๐ž ๐ญ๐ก๐š๐ญ ๐จ๐ฎ๐ซ ๐ง๐ž๐ฐ ๐ฉ๐š๐ฉ๐ž๐ซ, "ThinkTuning", ๐ข๐ฌ ๐ง๐จ๐ฐ ๐จ๐ฎ๐ญ! ๐ŸŽ‰ ๐Ÿ”„RL merely draws out behaviors already present in the base models. Sophisticated thinking behaviors like self-reflection, self-correction and other multi-step reasoning

๐ŸŽ‰๐„๐ฑ๐œ๐ข๐ญ๐ž๐ ๐ญ๐จ ๐ฌ๐ก๐š๐ซ๐ž ๐ญ๐ก๐š๐ญ ๐จ๐ฎ๐ซ ๐ง๐ž๐ฐ ๐ฉ๐š๐ฉ๐ž๐ซ, "ThinkTuning", ๐ข๐ฌ ๐ง๐จ๐ฐ ๐จ๐ฎ๐ญ! ๐ŸŽ‰

๐Ÿ”„RL merely draws out behaviors already present in the base models. Sophisticated thinking behaviors like self-reflection, self-correction and other multi-step reasoning
Ben Zhou (@benzhou96) 's Twitter Profile Photo

Now accepted to #EMNLP2025! Check out our work on instilling โ€œthinking behaviorsโ€ to โ€œnon-thinking modelsโ€ without any distillation from thinking models!

Dongwon Jung (@dong_w0n) 's Twitter Profile Photo

Excited to share that two of my first-author papers were accepted to #EMNLP2025! โœจ๐Ÿ“š 1๏ธโƒฃ Code Execution as Grounded Supervision for LLM Reasoning (Main) 2๏ธโƒฃ Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation (Findings) Huge thanks to my collaborators๐Ÿ™Œ

jakedineenasu (@jakedineenasu) 's Twitter Profile Photo

Thrilled to share QA-LIGN ๐š๐ญ #EMNLP2025! Bridging rule-based rewards and LLM-as-a-Judge via LLM-derived symbolic reward rubrics. ๐Ÿ”— arxiv.org/pdf/2506.08123

Ben Zhou (@benzhou96) 's Twitter Profile Photo

In addition, we show a rubric-based reward can help steering modelsโ€™ thinking/reflection process, potentially key to future directions on infusing constraints in reasoning.