UW NLP (@uwnlp) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

thumb_up_off_alt241

chat_bubble_outline3

repeat46

shareShare

Kunal Jha

@kjha02

3 months ago

Our new paper (first one of my PhD!) on cooperative AI reveals a surprising insight: Environment Diversity > Partner Diversity. Agents trained in self-play across many environments learn cooperative norms that transfer to humans on novel tasks. shorturl.at/fqsNN🧵

thumb_up_off_alt137

chat_bubble_outline5

repeat32

shareShare

Melanie Sclar

@melaniesclar

3 months ago

See our work on procedurally generating challenging reasoning problems on detecting inconsistencies in stories! FlawedFictions is a great example what I'm most excited about: reliable synthetic data for reasoning in under-explored domains. (I'll be at ICLR to chat, DMs open!)

thumb_up_off_alt56

chat_bubble_outline3

repeat9

shareShare

Avinandan Bose

@avibose22

3 months ago

🧠 Your LLM should model how you think, not reduce you to preassigned traits 📢 Introducing LoRe: a low-rank reward modeling framework for personalized RLHF ❌ Demographic grouping/handcrafted traits ✅ Infers implicit preferences ✅ Few-shot adaptation 📄 arxiv.org/abs/2504.14439

thumb_up_off_alt110

chat_bubble_outline2

repeat26

shareShare

Liwei Jiang

@liweijianglw

3 months ago

Cracking the 𝐦𝐮𝐥𝐭𝐢-𝐭𝐮𝐫𝐧 safety challenge! ⚡️𝐗-𝐓𝐞𝐚𝐦𝐢𝐧𝐠⚡️ is a scalable red-teaming framework revealing diverse multi-turn LM vulnerabilities. Sneak peek: 96.2% attack success on Claude 3.7—despite its single-turn robustness & the largest multi-turn safety dataset!

thumb_up_off_alt14

chat_bubble_outline0

repeat2

shareShare

Ximing Lu

@gximing

3 months ago

With the rise of R1, search seems out of fashion? We prove the opposite! 😎 Introducing Retro-Search 🌈: an MCTS-inspired search algorithm that RETROspectively revises R1’s reasoning traces to synthesize untaken, new reasoning paths that are better 💡, yet shorter in length ⚡️.

thumb_up_off_alt250

chat_bubble_outline5

repeat102

shareShare

Avinandan Bose

@avibose22

3 months ago

Time to stress-test your AI agents — say hello to DoomArena 🔍🤖 A modular framework to red-team AI agents in realistic threat settings. Plug in attacks, swap threat models, and see what breaks. Built for adaptability, designed for chaos. Live now 🔧🕵️‍♂️🔥: github.com/ServiceNow/Doo…

thumb_up_off_alt10

chat_bubble_outline0

repeat4

shareShare

Melanie Sclar

@melaniesclar

3 months ago

Excited to be at #ICLR2025 🤩 I'll be giving an oral presentation for Creativity Index on Fri 25th 11:06, Garnet 212&219 🎙️ I'll also be presenting posters: 📍ExploreToM, Sat 26th 10:00, Hall 3 + 2B #49 📍CreativityIndex, Fri 25th 10:30, Hall 3 + 2B #618 Hope to see you there!

thumb_up_off_alt44

chat_bubble_outline1

repeat4

shareShare

Rulin Shao

@rulinshao

3 months ago

Meet ReasonIR-8B✨the first retriever specifically trained for reasoning tasks! Our challenging synthetic training data unlocks SOTA scores on reasoning IR and RAG benchmarks. ReasonIR-8B ranks 1st on BRIGHT and outperforms search engine and retriever baselines on MMLU and GPQA🔥

thumb_up_off_alt342

chat_bubble_outline5

repeat62

shareShare

Wenting Zhao

@wzhao_nlp

2 months ago

Excited to announce our workshop on Visions of Language Modeling at COLM'25! 🔥 We thought that current LM research overly focuses on a narrow set of popular topics (e.g., test-time scaling and LLM agents), and we'd love to bring some entropy back 💪 To do this, we invited a

thumb_up_off_alt96

chat_bubble_outline4

repeat14

shareShare

Peter West

@peterwesttm

2 months ago

Very excited for this unique workshop we're hosting at COLM -- rather than asking for submissions, we have a terrific, diverse set of speakers giving fresh perspectives on the future of LMs. Don't miss it!

thumb_up_off_alt25

chat_bubble_outline0

repeat1

shareShare

Stella Li

@stellalisy

2 months ago

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

thumb_up_off_alt1,1K

chat_bubble_outline69

repeat322

shareShare

Yizhong Wang

@yizhongwyz

2 months ago

Thrilled to announce that I will be joining UT Austin Computer Science at UT Austin as an assistant professor in fall 2026! I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

Thrilled to announce that I will be joining <a href="/UTAustin/">UT Austin</a> <a href="/UTCompSci/">Computer Science at UT Austin</a> as an assistant professor in fall 2026!

I will continue working on language models, data challenges, learning paradigms, & AI for innovation. Looking forward to teaming up with new students & colleagues! 🤠🤘

thumb_up_off_alt620

chat_bubble_outline98

repeat48

shareShare

Jaehun Jung

@jaehunjung_com

2 months ago

Data curation is crucial for LLM reasoning, but how do we know if our dataset is not overfit to one benchmark and generalizes to unseen distributions? 🤔 𝐃𝐚𝐭𝐚 𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 is key, when measured correct—it strongly predicts model generalization in reasoning tasks! 🧵

thumb_up_off_alt175

chat_bubble_outline4

repeat32

shareShare

Sahil Verma

@sahil1v

2 months ago

🚨 New Paper! 🚨 Guard models slow, language-specific, and modality-limited? Meet OmniGuard that detects harmful prompts across multiple languages & modalities all using one approach with SOTA performance in all 3 modalities!! while being 120X faster 🚀 arxiv.org/abs/2505.23856

thumb_up_off_alt73

chat_bubble_outline1

repeat33

shareShare

Jihan Yao

@jihan_yao

2 months ago

We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation ✅ Reliable: 94.3% agreement with human judgment ✅ Comprehensive: 4 modality combination × 49 tasks × 937 instructions 🔍Results and Takeaways: > GPT-Image-1 from OpenAI

thumb_up_off_alt29

chat_bubble_outline2

repeat17

shareShare

Yike Wang

@yikewang_

2 months ago

LLMs are helpful for scientific research — but will they continuously be helpful? Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).

thumb_up_off_alt236

chat_bubble_outline10

repeat53

shareShare

Liwei Jiang

@liweijianglw

a month ago

🛡️ We present 𝐒𝐞𝐥𝐟-𝐑𝐞𝐝𝐓𝐞𝐚𝐦, a 𝐟𝐮𝐥𝐥𝐲 𝐨𝐧𝐥𝐢𝐧𝐞 𝐬𝐞𝐥𝐟-𝐩𝐥𝐚𝐲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝐫𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 (𝐌𝐀𝐑𝐋) 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦 that co-evolves an Attacker and a Defender—both played by the same LM policy—in a continuous training

thumb_up_off_alt32

chat_bubble_outline0

repeat4

shareShare

Joongwon Kim

@danieljwkim

23 days ago

Can we improve Llama 3’s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. 📄 Paper: arxiv.org/abs/2507.00417

thumb_up_off_alt234

chat_bubble_outline5

repeat46

shareShare

UW NLP

Gate.io

Kabir

Kunal Jha

Melanie Sclar

Avinandan Bose

Liwei Jiang

Ximing Lu

Avinandan Bose

Melanie Sclar

Rulin Shao

Wenting Zhao

Peter West

Stella Li

Yizhong Wang

Jaehun Jung

Sahil Verma

Jihan Yao

Yike Wang

Liwei Jiang

Joongwon Kim