Zhiyuan Li (@zhiyuanli_) 's Twitter Profile
Zhiyuan Li

@zhiyuanli_

Assistant Professor @TTIC_Connect. Previously Postdoc @Stanford and PhD @PrincetonCS. Deep Learning Theory.

ID: 760226120335781895

linkhttp://zhiyuanli.ttic.edu calendar_today01-08-2016 21:30:06

69 Tweet

1,1K Followers

308 Following

M3L Workshop @ NeurIPS 2024 (@m3lworkshop) 's Twitter Profile Photo

📡Join us at the 2nd workshop on Mathematics of Modern Machine Learning (M3L) at #NeurIPS2024! sites.google.com/view/m3l-2024/ Submission deadline: September 29, 2024

📡Join us at the 2nd workshop on Mathematics of Modern Machine Learning (M3L) at #NeurIPS2024!

sites.google.com/view/m3l-2024/
Submission deadline: September 29, 2024
Zhiyuan Li (@zhiyuanli_) 's Twitter Profile Photo

Exciting new work led by amazing Kaiyue Wen on theoretical justification for the recent popular WSD schedule! This is based an interesting and novel assumption of training loss called "River Valley", which is useful to explain hidden progress in large learning rate training.

M3L Workshop @ NeurIPS 2024 (@m3lworkshop) 's Twitter Profile Photo

Hope everyone had fun at the 2nd workshop of M3L! Many thanks to the speakers, authors, reviewers, and participants for making this workshop a success. We had a full house again, and we hope to see you next year! đź’ˇ

Hope everyone had fun at the 2nd workshop of M3L! Many thanks to the speakers, authors, reviewers, and participants for making this workshop a success. We had a full house again, and we hope to see you next year! đź’ˇ
Kaifeng Lyu (@vfleaking) 's Twitter Profile Photo

Can we quantify the effect of learning rate schedules? Empirically, what's the best schedule for LLM pretraining? 🚀Excited to share our ICLR paper! arxiv.org/abs/2503.12811 With ≤3 runs, you can fit our empirical law and optimize your schedule—a WSD-like schedule is the best!

David Yin (@davidyin0609) 's Twitter Profile Photo

SVRG is popular in theoretical optimization, but it has not been widely adopted to train large neural networks. In our ICLR work “A Coefficient Makes SVRG Effective”, we show that adding a coefficient helps SVRG optimize deep neural networks. arxiv.org/abs/2311.05589

SVRG is popular in theoretical optimization, but it has not been widely adopted to train large neural networks.

In our ICLR work “A Coefficient Makes SVRG Effective”, we show that adding a coefficient helps SVRG optimize deep neural networks.

arxiv.org/abs/2311.05589
Nikunj Saunshi (@nsaunshi) 's Twitter Profile Photo

Don't miss the poster presentation for this by Nishanth Dikkala at #ICLR2025 tomorrow to learn more about our work on looped Transformers for reasoning! Poster #272: Hall 3 + 2B. Sat 26th April, 10am - 12:30pm Singapore time

Don't miss the poster presentation for this by <a href="/NishanthDikkala/">Nishanth Dikkala</a> at #ICLR2025 tomorrow to learn more about our work on looped Transformers for reasoning!

Poster #272: Hall 3 + 2B. Sat 26th April, 10am - 12:30pm Singapore time
Zhiyuan Li (@zhiyuanli_) 's Twitter Profile Photo

Excited to share our new method ✏️PENCIL! It decouples space complexity from time complexity in LLM reasoning, by allowing model to recursively erase and generate thoughts. Joint work w. my student Chenxiao Yang , along with Nati Srebro Bartom and David McAllester.