Shuibai Zhang (@shuibaiz69721) 's Twitter Profile
Shuibai Zhang

@shuibaiz69721

CS PhD student in UW-Madison

ID: 1645974868880674816

linkhttps://zhangshuibai.github.io calendar_today12-04-2023 02:19:36

62 Tweet

18 Followers

323 Following

Shuibai Zhang (@shuibaiz69721) 's Twitter Profile Photo

Excited to share Open-dCoder (0.5B) — the first fully open diffusion LLM for code! 🚀 The 0.5B diffusion coding model is surprisingly fun to play with! 🔥 Check out the blog for more detail👉 bit.ly/oDLLM-blog

Jason Weston (@jaseweston) 's Twitter Profile Photo

🌀New Test-time scaling method 🌀 📝: arxiv.org/abs/2509.06870 - Use RL to train an LLM solution aggregator – Reasons, reviews, reconciles, and synthesizes a final solution -> Much better than existing techniques! - Simple new method. Strong results across 4 math benchmarks. 🧵1/5

🌀New Test-time scaling method 🌀
📝: arxiv.org/abs/2509.06870
- Use RL to train an LLM solution aggregator
– Reasons, reviews, reconciles, and synthesizes a final solution
-> Much better than existing techniques!
- Simple new method. Strong results across 4 math benchmarks.
🧵1/5
Dulhan Jayalath (@dulhanjay) 's Twitter Profile Photo

🚨New Meta Superintelligence Labs Paper🚨 What do we do when we don’t have reference answers for RL? What if annotations are too expensive or unknown? Compute as Teacher (CaT🐈) turns inference compute into a post-training supervision signal. CaT improves up to 30% even on

🚨New Meta Superintelligence Labs Paper🚨

What do we do when we don’t have reference answers for RL? What if annotations are too expensive or unknown? Compute as Teacher (CaT🐈) turns inference compute into a post-training supervision signal. CaT improves up to 30% even on
Thinking Machines (@thinkymachines) 's Twitter Profile Photo

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
Jinjie Ni @ ICLR'25 🇸🇬 (@nijinjie) 's Twitter Profile Photo

🍷Imagine you are the boss of Google DeepMind. To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for? 🐿️ We build Quokka to help you decide–the first-ever large-scale scaling law for DLMs. Interesting facts: 1.

🍷Imagine you are the boss of Google DeepMind.

To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for?

🐿️ We build Quokka to help you decide–the first-ever large-scale scaling law for DLMs.

Interesting facts:

1.
Shuibai Zhang (@shuibaiz69721) 's Twitter Profile Photo

Uniform masking during training ≠ how we actually decode at inference. 🚨 That mismatch hurts dLLMs. Solution? Make training decoding-aware — so the model learns the way it generates. (Paper has all the details 📄, Congrats on Fred's great work!)

Nouha Dziri (@nouhadziri) 's Twitter Profile Photo

🚀Ever wondered how to make RL work on impossible hard tasks where pass@k = 0%? 🤔 In our new work, we share the RL Grokking Recipe: a training recipe that enables LLMs to solve previously unsolvable coding problems! I will be at #CoLM2025 next week so happy to chat about it!

🚀Ever wondered how to make RL work on impossible hard tasks where pass@k = 0%? 🤔

In our new work, we share the RL Grokking Recipe: a training recipe that enables LLMs to solve previously unsolvable coding problems! I will be at #CoLM2025 next week so happy to chat about it!
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models "we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood." "SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

"we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood."

"SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in
Shuibai Zhang (@shuibaiz69721) 's Twitter Profile Photo

Parallel decoding is a key challenge for DLLMs and all types of non-autoregressive models. We explored this in our work—better decoding strategies are needed, as confidence-based ones can be misleading (sometimes worse than random top-k).

Albert Ge (@albert_ge_95) 's Twitter Profile Photo

new state of the art UW School of Computer, Data & Information Sciences building fosters state of the art discussions 😃 excited to kickstart our new ml reading seminar! today we had Nicholas E. Corrado give a talk on his latest work on data mixing for llm alignment! our reading seminar sites.google.com/view/madml

new state of the art <a href="/uwcdis/">UW School of Computer, Data & Information Sciences</a> building fosters state of the art discussions 😃

excited to kickstart our new ml reading seminar! today we had <a href="/NicholasEC49673/">Nicholas E. Corrado</a> give a talk on his latest work on data mixing for llm alignment!

our reading seminar sites.google.com/view/madml
Nathan Barry (@nathanbarrydev) 's Twitter Profile Photo

BERT is just a Single Text Diffusion Step! (1/n) When I first read about language diffusion models, I was surprised to find that their training objective was just a generalization of masked language modeling (MLM), something we’ve been doing since BERT from 2018. The first

Shuibai Zhang (@shuibaiz69721) 's Twitter Profile Photo

This is pretty interesting, seems llada is well-calibrated, can you share more details about the experiment setting? what kind of mask ratio are you using when getting the acc for each bin?

Rosinality (@rosinality) 's Twitter Profile Photo

An architecture for self speculative decoding by supporting block diffusion and AR in the same model. I think this kind of approach is quite promising. Anyway, there are inherently sequential problems in generation (especially for agentic trajectories) and parallelizable ones at

An architecture for self speculative decoding by supporting block diffusion and AR in the same model. I think this kind of approach is quite promising. Anyway, there are inherently sequential problems in generation (especially for agentic trajectories) and parallelizable ones at