Kwangjun Ahn (@kwangjuna) 's Twitter Profile
Kwangjun Ahn

@kwangjuna

Senior Researcher at Microsoft Reserach // PhD from MIT EECS

ID: 1229766355622055936

linkhttp://kjahn.mit.edu/ calendar_today18-02-2020 13:56:04

42 Tweet

516 Followers

260 Following

Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

Check out Xiang Cheng’s talk on our linear transformer works given at Simons Institute!! youtube.com/live/PnwC74s1n…

Ahmad Beirami @ ICLR 2025 (@abeirami) 's Twitter Profile Photo

If you're at #NeurIPS2023, Kwangjun Ahn will be presenting his work on SpecTr++ in Optimal Transport workshop where he discusses improved transport plans for speculative decoding.

Aaron Defazio (@aaron_defazio) 's Twitter Profile Photo

Exciting new paper by Kwangjun Ahn (Kwangjun Ahn) and Ashok Cutkosky (Ashok Cutkosky)! Adam with model exponential moving average is effective for nonconvex optimization arxiv.org/pdf/2405.18199 This approach to analyzing Adam is extremely promising IMHO.

Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

I successfully defended my thesis at MIT EECS yesterday! A huge thank you to my advisors, Suvrit and Ali, and my committee Ashia! It talks about my recent works on Transformers and Adam those who are interested, check out the video: youtu.be/5rgrB7TGPdc

I successfully defended my thesis at MIT EECS yesterday! A huge thank you to my advisors, Suvrit and Ali, and my committee Ashia!
It talks about my recent works on Transformers and Adam those who are interested, check out the video: 
youtu.be/5rgrB7TGPdc
Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

In our ICML 2024 paper (ICML Conference), joint w/ Zhiyu Zhang (Zhiyu Zhang), Yunbum Kook, Yan Dai, we provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

In our ICML 2024 paper (<a href="/icmlconf/">ICML Conference</a>),  joint w/ Zhiyu Zhang (<a href="/imZhiyuZ/">Zhiyu Zhang</a>), Yunbum Kook, Yan Dai, we provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)
Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

Come to my presentation of ICML 2024 paper tmrw at 1:30–3 pm! We provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)

Come to my presentation of ICML 2024 paper tmrw at 1:30–3 pm!
We provide a new perspective on Adam optimizer based on online learning. In particular, our perspective shows the importance of Adam's key components. (video: youtu.be/AU39SNkkIsA)
John Langford (@johnclangford) 's Twitter Profile Photo

New reqs for low to high level researcher positions: jobs.careers.microsoft.com/global/en/job/… , jobs.careers.microsoft.com/global/en/job/…, jobs.careers.microsoft.com/global/en/job/…, jobs.careers.microsoft.com/global/en/job/…, with postdocs from Akshay and Miro Dudik x.com/MiroDudik/stat… . Please apply or pass to those who may :-)

John Langford (@johnclangford) 's Twitter Profile Photo

The Belief State Transformer edwardshu.com/bst-website/ is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: microsoft.com/en-us/research… and mgostIH for further discussion.

Kwangjun Ahn (@kwangjuna) 's Twitter Profile Photo

ICLR: Edward Hu and I will be presenting our work "The Belief State Transformer" at the 1st poster session. (#269) Please come check it out! (github: github.com/microsoft/BST)

Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.

Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.
Gagik Magakyan (@gagmagakyan) 's Twitter Profile Photo

If you are at ICML 2025, come check out our oral presentation about the non-convex theory of Schedule Free SGD in the Optimization session tomorrow! This work was done with amazing collaborators Kwangjun Ahn and Ashok Cutkosky.

Mikhail Parakhin (@mparakhin) 's Twitter Profile Photo

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...