Zhaofeng Wu @ ICLR (@zhaofeng_wu) 's Twitter Profile
Zhaofeng Wu @ ICLR

@zhaofeng_wu

PhD student @MIT_CSAIL | Previously @allen_ai | MS'21 BS'19 BA'19 @uwnlp

ID: 3231168386

linkhttps://zhaofengwu.github.io calendar_today31-05-2015 01:30:02

237 Tweet

1,1K Followers

249 Following

MIT NLP (@nlp_mit) 's Twitter Profile Photo

Hello everyone! We are quite a bit late to the twitter party, but welcome to the MIT NLP Group account! follow along for the latest research from our labs as we dive deep into language, learning, and logic 🤖📚🧠

Hello everyone! We are quite a bit late to the twitter party, but welcome to the MIT NLP Group account! follow along for the latest research from our labs as we dive deep into language, learning, and logic 🤖📚🧠
Jiacheng Liu (@liujc1998) 's Twitter Profile Photo

Today we're unveiling OLMoTrace, a tool that enables everyone to understand the outputs of LLMs by connecting to their training data. We do this on unprecedented scale and in real time: finding matching text between model outputs and 4 trillion training tokens within seconds. ✨

Zhaofeng Wu @ ICLR (@zhaofeng_wu) 's Twitter Profile Photo

Come chat with us on Saturday 4/26 at 10am (poster #240) if you're interested! Also DM is open -- happy to chat about multilinguality/interpretability/any random stuff during the conference! (though I may respond faster to email/whova)

Songlin Yang (@songlinyang4) 's Twitter Profile Photo

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

Lifan Yuan (@lifan__yuan) 's Twitter Profile Photo

We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.

We always want to scale up RL, yet simply training longer doesn't necessarily push the limits - exploration gets impeded by entropy collapse. 
We show that the performance ceiling is surprisingly predictable, and the collapse is driven by covariance between logp and advantage.
Simeng (Sophia) Han (@hansineng) 's Twitter Profile Photo

Zero fluff, maximum insight ✨. Let’s see what LLMs are really made of, with 🧠 Brainteasers. We’re not grading answers 🔢. We’re grading thinking 💭. Brute force? Creative leap? False confession? 🤔 Instead of asking “Did the model get the right answer?”, we ask: “Did it

Zero fluff, maximum insight ✨. 
Let’s see what LLMs are really made of, with 🧠 Brainteasers. 

We’re not grading answers 🔢. We’re grading thinking 💭. 
Brute force? Creative leap? False confession? 🤔

Instead of asking “Did the model get the right answer?”, 
we ask: “Did it
Naman Jain @ ICLR (@stringchaos) 's Twitter Profile Photo

Can SWE-Agents aid in High-Performance Software development? ⚡️🤔 Introducing GSO: A Challenging Code Optimization Benchmark 🔍 Unlike simple bug fixes, this combines algorithmic reasoning with systems programming 📊 Results: Current agents struggle with <5% success rate!

Can SWE-Agents aid in High-Performance Software development? ⚡️🤔

Introducing GSO: A Challenging Code Optimization Benchmark

🔍 Unlike simple bug fixes, this combines algorithmic reasoning with systems programming 

📊 Results: Current agents struggle with &lt;5% success rate!
Billy Xuanming Zhang (@xuanmingzhang07) 's Twitter Profile Photo

😵‍💫 Long-context human-AI planning with LLMs struggles when users have to manually manage all the context in messy chats (e.g. with ChatGPT). Meet 💡JumpStarter: task-structured context curation for better, collaborative planning with LLMs on complex tasks. 🧵 (1/n)

😵‍💫 Long-context human-AI planning with LLMs struggles when users have to manually manage all the context in messy chats (e.g. with ChatGPT). 
Meet 💡JumpStarter: task-structured context curation for better, collaborative planning with LLMs on complex tasks. 🧵 (1/n)
Mengzhou Xia (@xiamengzhou) 's Twitter Profile Photo

Surprisingly, we find training only with incorrect traces leads to strong performance 🤯 Even more interesting: it improves model diversity and test-time scaling—while correct traces do the opposite. Check out the 🧵👇

Ximing Lu (@gximing) 's Twitter Profile Photo

What happens when you ✨scale up RL✨? In our new work, Prolonged RL, we significantly scale RL training to >2k steps and >130k problems—and observe exciting, non-saturating gains as we spend more compute 🚀.

What happens when you ✨scale up RL✨? In our new work, Prolonged RL, we significantly scale RL training to &gt;2k steps and &gt;130k problems—and observe exciting, non-saturating gains as we spend more compute 🚀.
Han Guo (@hanguo97) 's Twitter Profile Photo

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?

Introducing Log-Linear Attention with:

- Log-linear time training
- Log-time inference (in both time and memory)
- Hardware-efficient Triton kernels
Junhong Shen (@junhongshen1) 's Twitter Profile Photo

🔥Unlocking New Paradigm for Test-Time Scaling of Agents! We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976

🔥Unlocking New Paradigm for Test-Time Scaling of Agents!

We introduce Test-Time Interaction (TTI), which scales the number of interaction steps beyond thinking tokens per step.

Our agents learn to act longer➡️richer exploration➡️better success

Paper: arxiv.org/abs/2506.07976
Songlin Yang (@songlinyang4) 's Twitter Profile Photo

Flash Linear Attention (github.com/fla-org/flash-…) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:

Yijia Shao (@echoshao8899) 's Twitter Profile Photo

🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody asks them what they want. While AI races to automate everything, we took a different approach: auditing what workers want vs. what AI can do across the US workforce.🧵

🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody asks them what they want.

While AI races to automate everything, we took a different approach: auditing what workers want vs. what AI can do across the US workforce.🧵
Jyo Pari (@jyo_pari) 's Twitter Profile Photo

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

What if an LLM could update its own weights?

Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs.

Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Kaiser Sun (@kaiserwholearns) 's Twitter Profile Photo

What happens when an LLM is asked to use information that contradicts its knowledge? We explore knowledge conflict in a new preprint📑 TLDR: Performance drops, and this could affect the overall performance of LLMs in model-based evaluation.📑🧵⬇️ 1/8 #NLProc #LLM #AIResearch

What happens when an LLM is asked to use information that contradicts its knowledge? We explore knowledge conflict in a new preprint📑
TLDR: Performance drops, and this could affect the overall performance of LLMs in model-based evaluation.📑🧵⬇️ 1/8
#NLProc #LLM #AIResearch