Kaiyue Wen (@wen_kaiyue) 's Twitter Profile
Kaiyue Wen

@wen_kaiyue

A continuous learner

ID: 1672114677659365378

linkhttp://wenkaiyue.com calendar_today23-06-2023 05:29:50

78 Tweet

313 Followers

457 Following

Simon Park (@parksimon0808) 's Twitter Profile Photo

Does all LLM reasoning transfer to VLM? In context of Simple-to-Hard generalization we show: NO! We also give ways to reduce this modality imbalance. Paper arxiv.org/abs/2501.02669 Code github.com/princeton-pli/… Abhishek Panigrahi Yun (Catherine) Cheng Dingli Yu Anirudh Goyal Sanjeev Arora

Does all LLM reasoning transfer to VLM? In context of Simple-to-Hard generalization we show: NO! We also give ways to reduce this modality imbalance.

Paper arxiv.org/abs/2501.02669
Code github.com/princeton-pli/…

<a href="/Abhishek_034/">Abhishek Panigrahi</a> <a href="/chengyun01/">Yun (Catherine) Cheng</a> <a href="/dingli_yu/">Dingli Yu</a> <a href="/anirudhg9119/">Anirudh Goyal</a> <a href="/prfsanjeevarora/">Sanjeev Arora</a>
Songlin Yang (@songlinyang4) 's Twitter Profile Photo

I've created slides for those curious about the recent rapid progress in linear attention: from linear attention to Lightning-Attention, Mamba2, DeltaNet, and TTT/Titans. Check it out here: sustcsonglin.github.io/assets/pdf/tal…

Qwen (@alibaba_qwen) 's Twitter Profile Photo

🚀 New Approach to Training MoE Models! We’ve made a key change: switching from micro-batches to global-batches for better load balancing. This simple tweak lets experts specialize more effectively, leading to: ✅ Improved model performance ✅ Better handling of real-world

🚀 New Approach to Training MoE Models! We’ve made a key change: switching from micro-batches to global-batches for better load balancing. This simple tweak lets experts specialize more effectively, leading to: 
✅ Improved model performance  
✅ Better handling of real-world
Tongtian Zhu (@tongtian_zhu) 's Twitter Profile Photo

Super excited to share our work on data influence cascade in decentralized learning, just accepted by #ICLR2025! 🎉 Data quality is crucial for LM training. But can we quantify the importance of data in a fully decentralized learning system? 🤔 Here’s a surprising insight: the

Super excited to share our work on data influence cascade in decentralized learning, just accepted by #ICLR2025! 🎉

Data quality is crucial for LM training. But can we quantify the importance of data in a fully decentralized learning system? 🤔

Here’s a surprising insight: the
Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

Transformers can overcome easy-to-hard and length generalization challenges through recursive self-improvement. Paper on arxiv coming on Monday. Link to a talk I gave on this below 👇 Super excited about this work!

Transformers can overcome easy-to-hard and length generalization challenges through recursive self-improvement. 

Paper on arxiv coming on Monday.
Link to a talk I gave on this below 👇

Super excited about this work!
Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

Learning rate schedulers used to be a big mistery. Now you can just take a guarantee for *convex non-smooth* problems (from arxiv.org/abs/2310.07831), and they give you *precisely* what you see in training large models. See this empirical study: arxiv.org/abs/2501.18965 1/3

Learning rate schedulers used to be a big mistery. Now you can just take a guarantee for *convex non-smooth* problems (from arxiv.org/abs/2310.07831), and they give you *precisely* what you see in training large models. 
See this empirical study:
arxiv.org/abs/2501.18965
1/3
Tengyu Ma (@tengyuma) 's Twitter Profile Photo

RL + CoT works great for DeepSeek-R1 & o1, but:  1️⃣ Linear-in-log scaling in train & test-time compute 2️⃣ Likely bounded by difficulty of training problems Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵 arxiv.org/abs/2502.00212

RL + CoT works great for DeepSeek-R1 &amp; o1, but: 

1️⃣ Linear-in-log scaling in train &amp; test-time compute
2️⃣ Likely bounded by difficulty of training problems

Meet STP—a self-play algorithm that conjectures &amp; proves indefinitely, scaling better! 🧠⚡🧵🧵

arxiv.org/abs/2502.00212
Pierfrancesco Beneventano (@pierbeneventano) 's Twitter Profile Photo

I and Arseniy, I believe, made a step towards properly characterizing how and when the training of Mini-Batch SGD shows Edge of Stability/Break-Even Point (Stanisław Jastrzębski, Jeremy Cohen). Link: arxiv.org/abs/2412.20553

I and Arseniy, I believe, made a step towards properly characterizing how and when the training of Mini-Batch SGD shows Edge of Stability/Break-Even Point (<a href="/kudkudakpl/">Stanisław Jastrzębski</a>, <a href="/deepcohen/">Jeremy Cohen</a>).
Link: arxiv.org/abs/2412.20553
Shengguang Wu (@shengguangwu) 's Twitter Profile Photo

❓Do VLMs really pay attention to image inputs? 😮Shockingly, a VLM is most likely to generate the response below about 𝒶 𝒹𝑜𝑔 when given 𝐧𝐨 𝐢𝐦𝐚𝐠𝐞 𝐚𝐭 𝐚𝐥𝐥—and least likely when shown the correct image. 🏆To tackle this 𝐯𝐢𝐬𝐮𝐚𝐥 𝐧𝐞𝐠𝐥𝐞𝐜𝐭, we introduce a

❓Do VLMs really pay attention to image inputs?

😮Shockingly, a VLM is most likely to generate the response below about 𝒶 𝒹𝑜𝑔 when given 𝐧𝐨 𝐢𝐦𝐚𝐠𝐞 𝐚𝐭 𝐚𝐥𝐥—and least likely when shown the correct image.

🏆To tackle this 𝐯𝐢𝐬𝐮𝐚𝐥 𝐧𝐞𝐠𝐥𝐞𝐜𝐭, we introduce a
William Merrill (@lambdaviking) 's Twitter Profile Photo

How does the depth of a transformer affect reasoning capabilities? New preprint by myself and Ashish Sabharwal shows that a little depth goes a long way to increase transformers’ expressive power We take this as encouraging for further research on looped transformers!🧵

How does the depth of a transformer affect reasoning capabilities? New preprint by myself and <a href="/Ashish_S_AI/">Ashish Sabharwal</a> shows that a little depth goes a long way to increase transformers’ expressive power

We take this as encouraging for further research on looped transformers!🧵
Zhiyuan Zeng (@zhiyuanzeng_) 's Twitter Profile Photo

Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]

Christina Baek (@_christinabaek) 's Twitter Profile Photo

Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N

Are current reasoning models optimal for test-time scaling? 🌠
No! Models make the same incorrect guess over and over again.

We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math!

1/N
Yushun Zhang (@ericzhang0410) 's Twitter Profile Photo

New paper alert! We report that the Hessian of NNs has a very special structure: 1. it appears to be a "block-diagonal-block-circulant" matrix at initialization; 2. then it quickly evolves into a "near-block-diagonal" matrix along training. We then theoretically reveal two

New paper alert!  We report that the Hessian of NNs has a very special structure: 
1. it appears to be a "block-diagonal-block-circulant" matrix at initialization;
2. then it quickly evolves into a "near-block-diagonal" matrix along training.

We then theoretically reveal two
Percy Liang (@percyliang) 's Twitter Profile Photo

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision: