Xiaotian (Max) Han (@xiaotianhan1) 's Twitter Profile
Xiaotian (Max) Han

@xiaotianhan1

Assistant Professor @ Case Western Reserve | CS Ph.D. @TAMU | Ex-Research Intern @Amazon @Meta @Snap | #machine_learning

ID: 1053627222647427072

linkhttp://ahxt.github.io/ calendar_today20-10-2018 12:41:20

178 Tweet

1,1K Followers

1,1K Following

Xiaotian (Max) Han (@xiaotianhan1) 's Twitter Profile Photo

Change to training the mini-r1-zero on the countdown task with Qwen2.5-1.5B. It's disappointing to see the completion length decrease while accuracy rewards improve.🙃

Change to training the mini-r1-zero on the countdown task with Qwen2.5-1.5B. It's disappointing to see the completion length decrease while accuracy rewards improve.🙃
Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 New paper & dataset! 🚨 NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions - Synthesizes 2.8M challenging and diverse questions which require multi-step reasoning, along with reference answers - Shows steeper data scaling curve for knowledge distillation

🚨 New paper & dataset! 🚨
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
- Synthesizes 2.8M challenging and diverse questions which require multi-step reasoning, along with reference answers
- Shows steeper data scaling curve for knowledge distillation
Kimi.ai (@kimi_moonshot) 's Twitter Profile Photo

🚀 Introducing our new tech report: Muon is Scalable for LLM Training We found that Muon optimizer can be scaled up using the follow techniques: • Adding weight decay • Carefully adjusting the per-parameter update scale ✨ Highlights: • ~2x computational efficiency vs AdamW

🚀 Introducing our new tech report: Muon is Scalable for LLM Training

We found that Muon optimizer can be scaled up using the follow techniques: 
• Adding weight decay
• Carefully adjusting the per-parameter update scale

✨ Highlights:
• ~2x computational efficiency vs AdamW
Xiaotian (Max) Han (@xiaotianhan1) 's Twitter Profile Photo

Thanks for featuring our work! More information here code: github.com/uservan/ThinkPO paper: arxiv.org/pdf/2502.13173 models (MATH500 91.2%): huggingface.co/VanWang/DeepSe… datasets: huggingface.co/datasets/VanWa…

Xiaotian (Max) Han (@xiaotianhan1) 's Twitter Profile Photo

📢 [New Research] Introducing Speculative Thinking—boosting small LLMs by leveraging large-model mentorship. Why? - Small models generate overly long responses, especially when incorrect. - Large models offer concise, accurate reasoning patterns. - Wrong reasoning (thoughts) is

📢 [New Research] Introducing Speculative Thinking—boosting small LLMs by leveraging large-model mentorship.

Why?
- Small models generate overly long responses, especially when incorrect.
- Large models offer concise, accurate reasoning patterns.
- Wrong reasoning (thoughts) is
Xiaotian (Max) Han (@xiaotianhan1) 's Twitter Profile Photo

📢New Paper "You Only Debias Once" (Oral Conference on Parsimony and Learning (CPAL)): Train once, adjust fairness flexibly at inference—no costly retraining! Instead of targeting a single fairness-optimal point in the weight space, YODO learns a line connecting accuracy and fairness optima. Just select your

📢New Paper 

"You Only Debias Once" (Oral <a href="/CPALconf/">Conference on Parsimony and Learning (CPAL)</a>):
Train once, adjust fairness flexibly at inference—no costly retraining! 

Instead of targeting a single fairness-optimal point in the weight space, YODO learns a line connecting accuracy and fairness optima. Just select your
Xiaotian (Max) Han (@xiaotianhan1) 's Twitter Profile Photo

This triggers me: do rebuttals actually improve paper quality? Not in my view. Often, rebuttals force us to address unrealistic questions about why our methods work or to conduct endless comparison experiments, neither of which genuinely enhance quality. In reality, the

Xiaotian (Max) Han (@xiaotianhan1) 's Twitter Profile Photo

one piece of evidence showing Muon's superiority over AdamW🎉. In 1B LLaMA training, the speed difference is minor, with throughput slightly dropping from 64,000 to 61,500. Both AdamW and Muon use customized implementations. Keller Jordan

one piece of evidence showing Muon's superiority over AdamW🎉. In 1B LLaMA training, the speed difference is minor, with throughput slightly dropping from 64,000 to 61,500. Both AdamW and Muon use customized implementations. <a href="/kellerjordan0/">Keller Jordan</a>
Yu Wang (@__yuwang__) 's Twitter Profile Photo

Introducing The Most Advanced Memory System for LLM Agents MIRIX is by far the most advanced memory system in the world, designed to make AI truly remember, learn, and help you over time. Website: mirix.io Paper: arxiv.org/abs/2507.07957 Github: