UW NLP (@uwnlp) 's Twitter Profile
UW NLP

@uwnlp

The NLP group at the University of Washington.

ID: 3716745856

calendar_today20-09-2015 10:26:25

1,1K Tweet

12,12K Followers

170 Following

Kabir (@kabirahuja004) 's Twitter Profile Photo

๐Ÿ“ข New Paper! Tired ๐Ÿ˜ด of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a storyโ€™s world ๐ŸŒŽ W/ Melanie Sclar, and tsvetshop 1/n

๐Ÿ“ข New Paper!

Tired ๐Ÿ˜ด of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a storyโ€™s world ๐ŸŒŽ

W/ <a href="/melaniesclar/">Melanie Sclar</a>, and <a href="/tsvetshop/">tsvetshop</a>

1/n
Kunal Jha (@kjha02) 's Twitter Profile Photo

Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity > Partner Diversity. Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks. shorturl.at/fqsNN๐Ÿงต

Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity &gt; Partner Diversity.

Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks.

shorturl.at/fqsNN๐Ÿงต
Melanie Sclar (@melaniesclar) 's Twitter Profile Photo

See our work on procedurally generating challenging reasoning problems on detecting inconsistencies in stories! FlawedFictions is a great example what I'm most excited about: reliable synthetic data for reasoning in under-explored domains. (I'll be at ICLR to chat, DMs open!)

Avinandan Bose (@avibose22) 's Twitter Profile Photo

๐Ÿง  Your LLM should model how you think, not reduce you to preassigned traits ๐Ÿ“ข Introducing LoRe: a low-rank reward modeling framework for personalized RLHF โŒ Demographic grouping/handcrafted traits โœ… Infers implicit preferences โœ… Few-shot adaptation ๐Ÿ“„ arxiv.org/abs/2504.14439

๐Ÿง  Your LLM should model how you think, not reduce you to preassigned traits
๐Ÿ“ข Introducing LoRe: a low-rank reward modeling framework for personalized RLHF
โŒ Demographic grouping/handcrafted traits
โœ… Infers implicit preferences
โœ… Few-shot adaptation
๐Ÿ“„ arxiv.org/abs/2504.14439
Liwei Jiang (@liweijianglw) 's Twitter Profile Photo

Cracking the ๐ฆ๐ฎ๐ฅ๐ญ๐ข-๐ญ๐ฎ๐ซ๐ง safety challenge! โšก๏ธ๐—-๐“๐ž๐š๐ฆ๐ข๐ง๐ โšก๏ธ is a scalable red-teaming framework revealing diverse multi-turn LM vulnerabilities. Sneak peek: 96.2% attack success on Claude 3.7โ€”despite its single-turn robustness & the largest multi-turn safety dataset!

Ximing Lu (@gximing) 's Twitter Profile Photo

With the rise of R1, search seems out of fashion? We prove the opposite! ๐Ÿ˜Ž Introducing Retro-Search ๐ŸŒˆ: an MCTS-inspired search algorithm that RETROspectively revises R1โ€™s reasoning traces to synthesize untaken, new reasoning paths that are better ๐Ÿ’ก, yet shorter in length โšก๏ธ.

With the rise of R1, search seems out of fashion? We prove the opposite! ๐Ÿ˜Ž

Introducing Retro-Search ๐ŸŒˆ: an MCTS-inspired search algorithm that RETROspectively revises R1โ€™s reasoning traces to synthesize untaken, new reasoning paths that are better ๐Ÿ’ก, yet shorter in length โšก๏ธ.
Avinandan Bose (@avibose22) 's Twitter Profile Photo

Time to stress-test your AI agents โ€” say hello to DoomArena ๐Ÿ”๐Ÿค– A modular framework to red-team AI agents in realistic threat settings. Plug in attacks, swap threat models, and see what breaks. Built for adaptability, designed for chaos. Live now ๐Ÿ”ง๐Ÿ•ต๏ธโ€โ™‚๏ธ๐Ÿ”ฅ: github.com/ServiceNow/Dooโ€ฆ

Time to stress-test your AI agents โ€” say hello to DoomArena ๐Ÿ”๐Ÿค–

A modular framework to red-team AI agents in realistic threat settings.
Plug in attacks, swap threat models, and see what breaks.
Built for adaptability, designed for chaos.
Live now ๐Ÿ”ง๐Ÿ•ต๏ธโ€โ™‚๏ธ๐Ÿ”ฅ: github.com/ServiceNow/Dooโ€ฆ
Melanie Sclar (@melaniesclar) 's Twitter Profile Photo

Excited to be at #ICLR2025 ๐Ÿคฉ I'll be giving an oral presentation for Creativity Index on Fri 25th 11:06, Garnet 212&219 ๐ŸŽ™๏ธ I'll also be presenting posters: ๐Ÿ“ExploreToM, Sat 26th 10:00, Hall 3 + 2B #49 ๐Ÿ“CreativityIndex, Fri 25th 10:30, Hall 3 + 2B #618 Hope to see you there!

Rulin Shao (@rulinshao) 's Twitter Profile Photo

Meet ReasonIR-8Bโœจthe first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA๐Ÿ”ฅ

Meet ReasonIR-8Bโœจthe first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA๐Ÿ”ฅ
Wenting Zhao (@wzhao_nlp) 's Twitter Profile Photo

Excited to announce our workshop on Visions of Language Modeling at COLM'25! ๐Ÿ”ฅ We thought that current LM research overly focuses on a narrow set of popular topics (e.g., test-time scaling and LLM agents), and we'd love to bring some entropy back ๐Ÿ’ช To do this, we invited a

Excited to announce our workshop on Visions of Language Modeling at COLM'25! ๐Ÿ”ฅ

We thought that current LM research overly focuses on a narrow set of popular topics (e.g., test-time scaling and LLM agents), and we'd love to bring some entropy back ๐Ÿ’ช To do this, we invited a
Peter West (@peterwesttm) 's Twitter Profile Photo

Very excited for this unique workshop we're hosting at COLM -- rather than asking for submissions, we have a terrific, diverse set of speakers giving fresh perspectives on the future of LMs. Don't miss it!

Stella Li (@stellalisy) 's Twitter Profile Photo

๐Ÿคฏ We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even workโ‰๏ธ Here's why: ๐Ÿงต Blogpost: tinyurl.com/spurious-rewarโ€ฆ

๐Ÿคฏ We cracked RLVR with... Random Rewards?!
Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:
- Random rewards: +21%
- Incorrect rewards: +25%
- (FYI) Ground-truth rewards: + 28.8%
How could this even workโ‰๏ธ Here's why: ๐Ÿงต
Blogpost: tinyurl.com/spurious-rewarโ€ฆ
Yizhong Wang (@yizhongwyz) 's Twitter Profile Photo

Thrilled to announce that I will be joining UT Austin Computer Science at UT Austin as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! ๐Ÿค ๐Ÿค˜

Thrilled to announce that I will be joining <a href="/UTAustin/">UT Austin</a> <a href="/UTCompSci/">Computer Science at UT Austin</a> as an assistant professor in fall 2026! 

I will continue working on language models, data challenges, learning paradigms, &amp; AI for innovation. Looking forward to teaming up with new students &amp; colleagues! ๐Ÿค ๐Ÿค˜
Jaehun Jung (@jaehunjung_com) 's Twitter Profile Photo

Data curation is crucial for LLM reasoning, but how do we know if our dataset is not overfit to one benchmark and generalizes to unseen distributions? ๐Ÿค” ๐ƒ๐š๐ญ๐š ๐๐ข๐ฏ๐ž๐ซ๐ฌ๐ข๐ญ๐ฒ is key, when measured correctโ€”it strongly predicts model generalization in reasoning tasks! ๐Ÿงต

Data curation is crucial for LLM reasoning, but how do we know if our dataset is not overfit to one benchmark and generalizes to unseen distributions? ๐Ÿค”

๐ƒ๐š๐ญ๐š ๐๐ข๐ฏ๐ž๐ซ๐ฌ๐ข๐ญ๐ฒ is key, when measured correctโ€”it strongly predicts model generalization in reasoning tasks! ๐Ÿงต
Sahil Verma (@sahil1v) 's Twitter Profile Photo

๐Ÿšจ New Paper! ๐Ÿšจ Guard models slow, language-specific, and modality-limited? Meet OmniGuard that detects harmful prompts across multiple languages & modalities all using one approach with SOTA performance in all 3 modalities!! while being 120X faster ๐Ÿš€ arxiv.org/abs/2505.23856

๐Ÿšจ New Paper! ๐Ÿšจ
Guard models slow, language-specific, and modality-limited?

Meet OmniGuard that detects harmful prompts across multiple languages &amp; modalities all using one approach with SOTA performance in all 3 modalities!! while being 120X faster ๐Ÿš€

arxiv.org/abs/2505.23856
Jihan Yao (@jihan_yao) 's Twitter Profile Photo

We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation โœ… Reliable: 94.3% agreement with human judgment โœ… Comprehensive: 4 modality combination ร— 49 tasks ร— 937 instructions ๐Ÿ”Results and Takeaways: > GPT-Image-1 from OpenAI

We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

โœ… Reliable: 94.3% agreement with human judgment
โœ… Comprehensive: 4 modality combination ร— 49 tasks ร— 937 instructions

๐Ÿ”Results and Takeaways:

&gt; GPT-Image-1 from <a href="/OpenAI/">OpenAI</a>
Yike Wang (@yikewang_) 's Twitter Profile Photo

LLMs are helpful for scientific research โ€” but will they continuously be helpful? Introducing ๐Ÿ”ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).

LLMs are helpful for scientific research โ€” but will they continuously be helpful?

Introducing ๐Ÿ”ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).
Liwei Jiang (@liweijianglw) 's Twitter Profile Photo

๐Ÿ›ก๏ธ We present ๐’๐ž๐ฅ๐Ÿ-๐‘๐ž๐๐“๐ž๐š๐ฆ, a ๐Ÿ๐ฎ๐ฅ๐ฅ๐ฒ ๐จ๐ง๐ฅ๐ข๐ง๐ž ๐ฌ๐ž๐ฅ๐Ÿ-๐ฉ๐ฅ๐š๐ฒ ๐ฆ๐ฎ๐ฅ๐ญ๐ข-๐š๐ ๐ž๐ง๐ญ ๐ซ๐ž๐ข๐ง๐Ÿ๐จ๐ซ๐œ๐ž๐ฆ๐ž๐ง๐ญ ๐ฅ๐ž๐š๐ซ๐ง๐ข๐ง๐  (๐Œ๐€๐‘๐‹) ๐š๐ฅ๐ ๐จ๐ซ๐ข๐ญ๐ก๐ฆ that co-evolves an Attacker and a Defenderโ€”both played by the same LM policyโ€”in a continuous training

Joongwon Kim (@danieljwkim) 's Twitter Profile Photo

Can we improve Llama 3โ€™s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. ๐Ÿ“„ Paper: arxiv.org/abs/2507.00417