Greg Durrett (@gregd_nlp) 's Twitter Profile
Greg Durrett

@gregd_nlp

CS professor at UT Austin. Large language models and NLP. he/him

ID: 938457074278846468

calendar_today06-12-2017 17:16:17

1,1K Tweet

6,6K Followers

797 Following

XLLM-Reason-Plan (@xllmreasonplan) 's Twitter Profile Photo

πŸ“’Announcing 𝐭𝐑𝐞 𝐟𝐒𝐫𝐬𝐭 𝐰𝐨𝐫𝐀𝐬𝐑𝐨𝐩 𝐨𝐧 𝐭𝐑𝐞 𝐀𝐩𝐩π₯𝐒𝐜𝐚𝐭𝐒𝐨𝐧 𝐨𝐟 π‹π‹πŒ 𝐄𝐱𝐩π₯πšπ’π§πšπ›π’π₯𝐒𝐭𝐲 𝐭𝐨 π‘πžπšπ¬π¨π§π’π§π  𝐚𝐧𝐝 𝐏π₯𝐚𝐧𝐧𝐒𝐧𝐠 at Conference on Language Modeling! We welcome perspectives from LLM, XAI, and HCI! CFP (Due June 23): …reasoning-planning-workshop.github.io

πŸ“’Announcing 𝐭𝐑𝐞 𝐟𝐒𝐫𝐬𝐭 𝐰𝐨𝐫𝐀𝐬𝐑𝐨𝐩 𝐨𝐧 𝐭𝐑𝐞 𝐀𝐩𝐩π₯𝐒𝐜𝐚𝐭𝐒𝐨𝐧 𝐨𝐟 π‹π‹πŒ 𝐄𝐱𝐩π₯πšπ’π§πšπ›π’π₯𝐒𝐭𝐲 𝐭𝐨 π‘πžπšπ¬π¨π§π’π§π  𝐚𝐧𝐝 𝐏π₯𝐚𝐧𝐧𝐒𝐧𝐠 at <a href="/COLM_conf/">Conference on Language Modeling</a>! 
We welcome perspectives from LLM, XAI, and HCI!
CFP (Due June 23): …reasoning-planning-workshop.github.io
David Bau (@davidbau) 's Twitter Profile Photo

Dear MAGA friends, I have been worrying about STEM in the US a lot, because right now the Senate is writing new laws that cut 75% of the STEM budget in the US. Sorry for the long post, but the issue is really important, and I want to share what I know about it. The entire

Kanishka Misra 🌊 (@kanishkamisra) 's Twitter Profile Photo

NewsπŸ—žοΈ I will return to UT Austin as an Assistant Professor of Linguistics this fall, and join its vibrant community of Computational Linguists, NLPers, and Cognitive Scientists!🀘 Excited to develop ideas about linguistic and conceptual generalization! Recruitment details soon

NewsπŸ—žοΈ

I will return to UT Austin as an Assistant Professor of Linguistics this fall, and join its vibrant community of Computational Linguists, NLPers, and Cognitive Scientists!🀘

Excited to develop ideas about linguistic and conceptual generalization! Recruitment details soon
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

Great to work on this benchmark with astronomers in our NSF-Simons CosmicAI institute! What I like about it: (1) focus on data processing & visualization, a "bite-sized" AI4Sci task (not automating all of research) (2) eval with VLM-as-a-judge (possible with strong, modern VLMs)

Vaishnavh Nagarajan (@_vaishnavh) 's Twitter Profile Photo

πŸ“’ New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: β†’ LLMs are limited in creativity since they learn to predict the next token β†’ creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧡

πŸ“’ New paper on creativity &amp; multi-token prediction! We design minimal open-ended tasks to argue:

β†’ LLMs are limited in creativity since they learn to predict the next token

β†’ creativity can be improved via multi-token learning &amp; injecting noise ("seed-conditioning" 🌱) 1/ 🧡
Fangcong Yin (@fangcong_y10593) 's Twitter Profile Photo

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be β€œcomposable” with other skills πŸ”₯Train models on each skill πŸ“ŒCombine those models Lead to better 0-shot reasoning on tasks involving skill composition!

Solving complex problems with CoT requires combining different skills.

We can do this by:
🧩Modify the CoT data format to be β€œcomposable” with other skills
πŸ”₯Train models on each skill
πŸ“ŒCombine those models

Lead to better 0-shot reasoning on tasks involving skill composition!
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

CoT is effective for in-domain reasoning tasks, but Fangcong's work takes a nice step in improving compositional generalization of CoT reasoning. We teach models that atomic CoT skills fit together like puzzle pieces so it can then combine them in novel ways. Lots to do here!

Asher Zheng (@asher_zheng00) 's Twitter Profile Photo

Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.πŸ›Ÿ πŸ‘‰Introducing CoBRA🐍, a framework that assesses strategic language. Work with my amazing advisors Jessy Li and David Beaver! πŸ§΅πŸ‘‡

Language is often strategic, but LLMs tend to play nice. How strategic are they really? Probing into that is key for future safety alignment.πŸ›Ÿ

πŸ‘‰Introducing CoBRA🐍, a framework that assesses strategic language.

Work with my amazing advisors <a href="/jessyjli/">Jessy Li</a> and <a href="/David_Beaver/">David Beaver</a>!
πŸ§΅πŸ‘‡
CosmicAI (@cosmicai_inst) 's Twitter Profile Photo

CosmicAI collab: benchmarking the utility of LLMs in astronomy coding workflows & focusing on the key research capability of scientific visualization. Sebastian Joseph Jessy Li Murtaza Husain Greg Durrett Dr. Stephanie Juneau paul.torrey Adam Bolton, Stella Offner, Juan Frias, Niall Gaffney

Ryan Marten (@ryanmart3n) 's Twitter Profile Photo

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.

We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
Bespoke Labs (@bespokelabsai) 's Twitter Profile Photo

Understanding what’s in the data is a high leverage activity when it comes to training/evaluating models and agents. This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. Our viewer for GPQA (Google

Understanding what’s in the data is a high leverage activity when it comes to training/evaluating models and agents.

This week we will drill down into a few popular benchmarks and share some custom viewers that will help pop up various insights. 

Our viewer for GPQA (Google
Xi Ye (@xiye_nlp) 's Twitter Profile Photo

πŸ€” Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? πŸ“£ Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: πŸ” Better head detection: we find a

πŸ€” Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval?
πŸ“£ Introducing QRHeads (query-focused retrieval heads) that enhance retrieval

Main contributions:
 πŸ” Better head detection: we find a
Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

I'm excited about Leo's use of hypernetworks for data efficient knowledge editing! Tweaking what a model learns from data is very powerful & useful for other goals like alignment. Haven't seen much other work building on MEND recently, but let me know what cool stuff we missed!

Greg Durrett (@gregd_nlp) 's Twitter Profile Photo

If we don't do physical work in our jobs, we go to the gym and work out. What are the gyms for skills that LLMs will automate?

Percy Liang (@percyliang) 's Twitter Profile Photo

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team Tatsunori Hashimoto Marcel RΓΈd Neil Band Rohith Kuditipudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

Xi Ye (@xiye_nlp) 's Twitter Profile Photo

There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason β€” they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning

David Hall (@dlwh) 's Twitter Profile Photo

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )
Lily Chen (@lilyychenn) 's Twitter Profile Photo

Are we fact-checking medical claims the right way? πŸ©ΊπŸ€” Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems. We show whyβ€”and argue fact-checking should be a dialogue, with patients in the loop arxiv.org/abs/2506.20876 🧡1/

Are we fact-checking medical claims the right way? πŸ©ΊπŸ€”

Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems.

We show whyβ€”and argue fact-checking should be a dialogue, with patients in the loop

arxiv.org/abs/2506.20876

🧡1/