Seungone Kim @ NAACL2025 (@seungonekim) Twitter Tweets • TwiCopy

Rohan Paul

5 months ago

🎣 Unlocks how AI "thinks" by automatically mapping its reasoning steps. Our understanding of how LLMs use "think step-by-step" using Chain-of-Thought (CoT) to arrive at these reasoning strategies has been quite limited. The CoT Encyclopedia: This paper is a guide for the

thumb_up_off_alt9

chat_bubble_outline1

repeat3

shareShare

Joel Jang

@jang_yoel

5 months ago

Introducing 𝐃𝐫𝐞𝐚𝐦𝐆𝐞𝐧! We got humanoid robots to perform totally new 𝑣𝑒𝑟𝑏𝑠 in new environments through video world models. We believe video world models will solve the data problem in robotics. Bringing the paradigm of scaling human hours to GPU hours. Quick 🧵

thumb_up_off_alt326

chat_bubble_outline7

repeat65

shareShare

Dongkeun Yoon

@dongkeun_yoon

5 months ago

🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

thumb_up_off_alt298

chat_bubble_outline9

repeat50

shareShare

Seungone Kim @ NAACL2025

@seungonekim

5 months ago

Turns out that reasoning models not only excel at solving problems but are also excellent confidence estimators - an unexpected side effect of long CoTs! This reminds me that smart ppl are good at determining what they know & don't know👀 Check out Dongkeun Yoon 's post!

thumb_up_off_alt17

chat_bubble_outline0

repeat1

shareShare

AK

@_akhaliq

5 months ago

Web-Shepherd just dropped on Hugging Face Advancing PRMs for Reinforcing Web Agents

thumb_up_off_alt122

chat_bubble_outline1

repeat27

shareShare

Hyungjoo Chae

@hyungjoochae

5 months ago

🚀 Introducing Web-Shepherd: the first Process Reward Model (PRM) that guides web agents. 🌐 Current web browsing agents look cool, but they're not fully reliable! 😬They excel at simple tasks but struggle with complex ones. ❓ Can inference-time scaling help? Previous methods

thumb_up_off_alt69

chat_bubble_outline2

repeat16

shareShare

Seungone Kim @ NAACL2025

@seungonekim

5 months ago

We introduce Web Shepherd, the first PRM specialized for web navigation🌎 Prior works have used LLM-as-a-Judge to assess trajectories (RL) or each step (test-time algo.). Yet, this is not suitable in real-world scenarios since it takes too much time! Web Shepherd not only

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Shayne Longpre

@shayneredford

5 months ago

🚨 Lucie-Aimée Kaffee and I are looking for a junior collaborator to research the Open Model Ecosystem! 🤖 Ideally, someone w/ AI/ML background, who can help w/ annotation pipeline + analysis. docs.google.com/forms/d/e/1FAI…

thumb_up_off_alt98

chat_bubble_outline4

repeat23

shareShare

Chaeeun Kim

@chaechaek1214

5 months ago

❓What if your RAG didn’t need a separate retrieval model at all? We present 🧊FREESON, a new framework for retriever-FREE retrieval-augmented reasoning. With FREESON, a single LRM acts as both generator and retriever, shifting the focus from seq2seq matching to locating

thumb_up_off_alt29

chat_bubble_outline1

repeat5

shareShare

Seungone Kim @ NAACL2025

@seungonekim

5 months ago

Within the RAG pipeline, the retriever often acts as the bottleneck! Instead of training a better embedding model, we explore using a reasoning model both as the retriever&generator. To do this, we add MCTS to the generative retrieval pipeline. Check out Chaeeun Kim's post!

thumb_up_off_alt30

chat_bubble_outline0

repeat2

shareShare

Jiayi Geng

@jiayiigeng

5 months ago

Using LLMs to build AI scientists is all the rage now (e.g., Google’s AI co-scientist [1] and Sakana’s Fully Automated Scientist [2]), but how much do we understand about their core scientific abilities? We know how LLMs can be vastly useful (solving complex math problems) yet

thumb_up_off_alt475

chat_bubble_outline9

repeat73

shareShare

Hyeonbin Hwang

@ronalhwang

5 months ago

🚨 New Paper co-led with byeongguk jeon 🚨 Q. Can we adapt Language Models, trained to predict next token, to reason in sentence-level? I think LMs operating in higher-level abstraction would be a promising path towards advancing its reasoning, and I am excited to share our

🚨 New Paper co-led with <a href="/bkjeon1211/">byeongguk jeon</a> 🚨

Q. Can we adapt Language Models, trained to predict next token, to reason in sentence-level?

I think LMs operating in higher-level abstraction would be a promising path towards advancing its reasoning, and I am excited to share our

thumb_up_off_alt167

chat_bubble_outline4

repeat44

shareShare

Yizhong Wang

@yizhongwyz

5 months ago

Thrilled to announce that I will be joining UT Austin Computer Science at UT Austin as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

Thrilled to announce that I will be joining <a href="/UTAustin/">UT Austin</a> <a href="/UTCompSci/">Computer Science at UT Austin</a> as an assistant professor in fall 2026!

I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

thumb_up_off_alt620

chat_bubble_outline98

repeat48

shareShare

Sean Welleck

@wellecks

5 months ago

New paper by Andre He: Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening arxiv.org/abs/2506.02355 Tired of sharpening the distribution? Try unlikeliness reward to learn new things from the roads less traveled

thumb_up_off_alt354

chat_bubble_outline4

repeat53

shareShare

Jiseung Hong

@jiseungh99

5 months ago

👆 OpenAI recently rolled back its GPT- 4o update due to Sycophancy—being overly flattering and agreeable. 🧐 However, can we measure sycophancy in these Real-World failure cases? 🤗 Introducing SYCON-Bench, a benchmark that quantifies sycophancy in multi-turn dialogues! 📑 1/5

thumb_up_off_alt6

chat_bubble_outline2

repeat4

shareShare

Apurva Gandhi

@apurvasgandhi

5 months ago

New preprint on web agents🚨 Go-Browse: Training Web Agents with Structured Exploration Problem: LLMs lack prior understanding of the websites that web agents will be deployed on. Solution: Go-Browse is an unsupervised method for automatically collecting diverse and realistic

thumb_up_off_alt37

chat_bubble_outline3

repeat6

shareShare

Bo Liu (Benjamin Liu)

@benjamin_eecs

4 months ago

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces

thumb_up_off_alt261

chat_bubble_outline3

repeat53

shareShare

Xiang Yue@ICLR2025🇸🇬

@xiangyue96

4 months ago

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we

thumb_up_off_alt604

chat_bubble_outline14

repeat124

shareShare

Sukjun (June) Hwang

@sukjun_hwang

4 months ago

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

thumb_up_off_alt2,2K

chat_bubble_outline58

repeat355

shareShare