Seungone Kim @ NAACL2025 (@seungonekim) 's Twitter Profile
Seungone Kim @ NAACL2025

@seungonekim

Ph.D. student @LTIatCMU and in-coming intern at @AIatMeta working on (V)LM Evaluation & Systems that Improve with (Human) Feedback | Prev: @kaist_ai @yonsei_u

ID: 1455179335548035074

linkhttps://seungonekim.github.io/ calendar_today01-11-2021 14:26:25

733 Tweet

1,1K Followers

888 Following

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

🎣 Unlocks how AI "thinks" by automatically mapping its reasoning steps. Our understanding of how LLMs use "think step-by-step" using Chain-of-Thought (CoT) to arrive at these reasoning strategies has been quite limited. The CoT Encyclopedia: This paper is a guide for the

🎣 Unlocks how AI "thinks" by automatically mapping its reasoning steps.

Our understanding of how LLMs use "think step-by-step" using Chain-of-Thought (CoT) to arrive at these reasoning strategies has been quite limited.

The CoT Encyclopedia:

This paper is a guide for the
Joel Jang (@jang_yoel) 's Twitter Profile Photo

Introducing 𝐃𝐫𝐞𝐚𝐦𝐆𝐞𝐧! We got humanoid robots to perform totally new 𝑣𝑒𝑟𝑏𝑠 in new environments through video world models. We believe video world models will solve the data problem in robotics. Bringing the paradigm of scaling human hours to GPU hours. Quick 🧵

Dongkeun Yoon (@dongkeun_yoon) 's Twitter Profile Photo

🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

🙁 LLMs are overconfident even when they are dead wrong.

🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”?

❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.
Seungone Kim @ NAACL2025 (@seungonekim) 's Twitter Profile Photo

Turns out that reasoning models not only excel at solving problems but are also excellent confidence estimators - an unexpected side effect of long CoTs! This reminds me that smart ppl are good at determining what they know & don't know👀 Check out Dongkeun Yoon 's post!

Hyungjoo Chae (@hyungjoochae) 's Twitter Profile Photo

🚀 Introducing Web-Shepherd: the first Process Reward Model (PRM) that guides web agents. 🌐 Current web browsing agents look cool, but they're not fully reliable! 😬They excel at simple tasks but struggle with complex ones. ❓ Can inference-time scaling help? Previous methods

Seungone Kim @ NAACL2025 (@seungonekim) 's Twitter Profile Photo

We introduce Web Shepherd, the first PRM specialized for web navigation🌎 Prior works have used LLM-as-a-Judge to assess trajectories (RL) or each step (test-time algo.). Yet, this is not suitable in real-world scenarios since it takes too much time! Web Shepherd not only

Shayne Longpre (@shayneredford) 's Twitter Profile Photo

🚨 Lucie-Aimée Kaffee and I are looking for a junior collaborator to research the Open Model Ecosystem! 🤖 Ideally, someone w/ AI/ML background, who can help w/ annotation pipeline + analysis. docs.google.com/forms/d/e/1FAI…

Chaeeun Kim (@chaechaek1214) 's Twitter Profile Photo

❓What if your RAG didn’t need a separate retrieval model at all? We present 🧊FREESON, a new framework for retriever-FREE retrieval-augmented reasoning. With FREESON, a single LRM acts as both generator and retriever, shifting the focus from seq2seq matching to locating

❓What if your RAG didn’t need a separate retrieval model at all?

We present 🧊FREESON, a new framework for retriever-FREE retrieval-augmented reasoning.

With FREESON,  a single LRM acts as both generator and retriever, shifting the focus from seq2seq matching to locating
Seungone Kim @ NAACL2025 (@seungonekim) 's Twitter Profile Photo

Within the RAG pipeline, the retriever often acts as the bottleneck! Instead of training a better embedding model, we explore using a reasoning model both as the retriever&generator. To do this, we add MCTS to the generative retrieval pipeline. Check out Chaeeun Kim's post!

Jiayi Geng (@jiayiigeng) 's Twitter Profile Photo

Using LLMs to build AI scientists is all the rage now (e.g., Google’s AI co-scientist [1] and Sakana’s Fully Automated Scientist [2]), but how much do we understand about their core scientific abilities? We know how LLMs can be vastly useful (solving complex math problems) yet

Using LLMs to build AI scientists is all the rage now (e.g., Google’s AI co-scientist [1] and Sakana’s Fully Automated Scientist [2]), but how much do we understand about their core scientific abilities?
We know how LLMs can be vastly useful (solving complex math problems) yet
Hyeonbin Hwang (@ronalhwang) 's Twitter Profile Photo

🚨 New Paper co-led with byeongguk jeon 🚨 Q. Can we adapt Language Models, trained to predict next token, to reason in sentence-level? I think LMs operating in higher-level abstraction would be a promising path towards advancing its reasoning, and I am excited to share our

🚨 New Paper co-led with <a href="/bkjeon1211/">byeongguk jeon</a> 🚨

Q. Can we adapt Language Models, trained to predict next token, to reason in sentence-level? 

I think LMs operating in higher-level abstraction would be a promising path towards advancing its reasoning, and I am excited to share our
Yizhong Wang (@yizhongwyz) 's Twitter Profile Photo

Thrilled to announce that I will be joining UT Austin Computer Science at UT Austin as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

Thrilled to announce that I will be joining <a href="/UTAustin/">UT Austin</a> <a href="/UTCompSci/">Computer Science at UT Austin</a> as an assistant professor in fall 2026! 

I will continue working on language models, data challenges, learning paradigms, &amp; AI for innovation. Looking forward to teaming up with new students &amp; colleagues! 🤠🤘
Sean Welleck (@wellecks) 's Twitter Profile Photo

New paper by Andre He: Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening arxiv.org/abs/2506.02355 Tired of sharpening the distribution? Try unlikeliness reward to learn new things from the roads less traveled

New paper by Andre He:

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

arxiv.org/abs/2506.02355

Tired of sharpening the distribution? Try unlikeliness reward to learn new things from the roads less traveled
Apurva Gandhi (@apurvasgandhi) 's Twitter Profile Photo

New preprint on web agents🚨 Go-Browse: Training Web Agents with Structured Exploration Problem: LLMs lack prior understanding of the websites that web agents will be deployed on. Solution: Go-Browse is an unsupervised method for automatically collecting diverse and realistic

Bo Liu (Benjamin Liu) (@benjamin_eecs) 's Twitter Profile Photo

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces

We've always been excited about self-play unlocking continuously improving agents. Our insight: RL selects generalizable CoT patterns from pretrained LLMs. Games provide perfect testing grounds with cheap, verifiable rewards. Self-play automatically discovers and reinforces
Xiang Yue@ICLR2025🇸🇬 (@xiangyue96) 's Twitter Profile Photo

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true? In our study (arxiv.org/pdf/2507.00432), we

People are racing to push math reasoning performance in #LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true?

In our study (arxiv.org/pdf/2507.00432), we
Sukjun (June) Hwang (@sukjun_hwang) 's Twitter Profile Photo

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data