Jeremy Cohen (@deepcohen) 's Twitter Profile
Jeremy Cohen

@deepcohen

Research fellow at Flatiron Institute, working on understanding optimization in deep learning. Previously: PhD in machine learning at Carnegie Mellon.

ID: 369877186

linkhttp://cs.cmu.edu/~jeremiac calendar_today08-09-2011 02:53:23

1,1K Tweet

4,4K Followers

908 Following

So Yeon (Tiffany) Min on Industry Job Market (@soyeontiffmin) 's Twitter Profile Photo

I am on the industry job market, and am planning to interview around next March. I am attending NeurIPS Conference, and I hope to meet you there if you are hiring! My website: soyeonm.github.io Short bio about me: I am a 5th year PhD student at CMU MLD, working with Russ Salakhutdinov

Alberto Bietti (@albertobietti) 's Twitter Profile Photo

Applications to our Research Fellow position at Flatiron CCM are closing soon on Dec 15! It's a great place for doing fundamental ML research with a lot of freedom in a great environment, in the heart of NYC. Apply here: apply.interfolio.com/155357

Berfin Simsek (@bsimsek13) 's Twitter Profile Photo

📢 I'm on the faculty job market this year! My research explores the foundations of deep learning and analyzes learning and feature geometry for Gaussian inputs. I detail my major contributions👇Retweet if you find it interesting and help me spread the word! DM is open. 1/n

Jeremy Cohen (@deepcohen) 's Twitter Profile Photo

I’ll be at NeurIPS from Wednesday through Sunday. Would be great to meet with anyone interested in optimization dynamics of deep learning! DMs are open.

Dayal Kalra (@dayal_kalra) 's Twitter Profile Photo

I'll be at #NeurIPS2024 this week, presenting our work tomorrow on the mechanisms of warmup! openreview.net/forum?id=NVl4S… 📍 West Ballroom A-D (#5907) 📅 Wed, Dec 11 ⏰ 4:30 PM - 7:30 PM PST Looking forward to engaging discussions!

I'll be at #NeurIPS2024 this week, presenting our work tomorrow on the mechanisms of warmup!

openreview.net/forum?id=NVl4S…

📍 West Ballroom A-D (#5907)
📅 Wed, Dec 11
⏰ 4:30 PM - 7:30 PM PST

Looking forward to engaging discussions!
Bobby (@bobby_he) 's Twitter Profile Photo

Come by poster #2402 East hall at NeurIPS from 11am-2pm Friday to chat about why outlier features emerge during training and how we can prevent them!

Come by poster #2402 East hall at NeurIPS from 11am-2pm Friday  to chat about why outlier features emerge during training and how we can prevent them!
Ameet Talwalkar (@atalwalkar) 's Twitter Profile Photo

I have some news to share! Datadog, Inc. is forming a new AI research lab, and I'm excited to announce that I've joined as Chief Scientist to lead this effort. Datadog has a great work culture, lots of data and compute, and is committed to open science and open sourcing. Our team

I have some news to share!

<a href="/datadoghq/">Datadog, Inc.</a> is forming a new AI research lab, and I'm excited to announce that I've joined as Chief Scientist to lead this effort.  Datadog has a great work culture, lots of data and compute, and is committed to open science and open sourcing.

Our team
Pierfrancesco Beneventano (@pierbeneventano) 's Twitter Profile Photo

I and Arseniy, I believe, made a step towards properly characterizing how and when the training of Mini-Batch SGD shows Edge of Stability/Break-Even Point (Stanisław Jastrzębski, Jeremy Cohen). Link: arxiv.org/abs/2412.20553

I and Arseniy, I believe, made a step towards properly characterizing how and when the training of Mini-Batch SGD shows Edge of Stability/Break-Even Point (<a href="/kudkudakpl/">Stanisław Jastrzębski</a>, <a href="/deepcohen/">Jeremy Cohen</a>).
Link: arxiv.org/abs/2412.20553
Samuel Sokota (@ssokota) 's Twitter Profile Photo

Model-free deep RL algorithms like NFSP, PSRO, ESCHER, & R-NaD are tailor-made for games with hidden information (e.g. poker). We performed the largest-ever comparison of these algorithms. We find that they do not outperform generic policy gradient methods, such as PPO. 1/N

Model-free deep RL algorithms like NFSP, PSRO, ESCHER, &amp; R-NaD are tailor-made for games with hidden information (e.g. poker).

We performed the largest-ever comparison of these algorithms.

We find that they do not outperform generic policy gradient methods, such as PPO. 1/N
Jacob Springer (@jacspringer) 's Twitter Profile Photo

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!

Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇

1/9
Christina Baek (@_christinabaek) 's Twitter Profile Photo

Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N

Are current reasoning models optimal for test-time scaling? 🌠
No! Models make the same incorrect guess over and over again.

We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math!

1/N
Asher Trockman (@ashertrockman) 's Twitter Profile Photo

Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your competitors' models are ... thinking a bit too much like yours? Then antidistillation.com might be for you! Sam Altman Elon Musk

Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your competitors' models are ... thinking a bit too much like yours?

Then antidistillation.com might be for you! <a href="/sama/">Sam Altman</a> <a href="/elonmusk/">Elon Musk</a>
Dayal Kalra (@dayal_kalra) 's Twitter Profile Photo

Excited to share our paper "Universal Sharpness Dynamics..." is accepted to #ICLR2025! Neural net training exhibits rich curvature (sharpness) dynamics (sharpness reduction, progressive sharpening, Edge of Stability)- but why?🤔 We show that a minimal model captures it all! 1/n

Vaishnavh Nagarajan (@_vaishnavh) 's Twitter Profile Photo

📢 New paper on creativity & multi-token prediction! We design minimal open-ended tasks to argue: → LLMs are limited in creativity since they learn to predict the next token → creativity can be improved via multi-token learning & injecting noise ("seed-conditioning" 🌱) 1/ 🧵

📢 New paper on creativity &amp; multi-token prediction! We design minimal open-ended tasks to argue:

→ LLMs are limited in creativity since they learn to predict the next token

→ creativity can be improved via multi-token learning &amp; injecting noise ("seed-conditioning" 🌱) 1/ 🧵
Robert M. Gower 🇺🇦 (@gowerrobert) 's Twitter Profile Photo

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

Are you interested in the new Muon/Scion/Gluon method for training LLMs? 
To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x
Jingfeng Wu (@uuujingfeng) 's Twitter Profile Photo

1/3 Sharing two new papers on accelerating GD via large stepsizes! Classical GD analysis assumes small stepsizes for stability. However, in practice, GD is often used with large stepsizes, which lead to instability. See my slides for more details: uuujf.github.io/postdoc/wu2025…

1/3 Sharing two new papers on accelerating GD via large stepsizes! 

Classical GD analysis assumes small stepsizes for stability. However, in practice, GD is often used with large stepsizes, which lead to instability. 

See my slides for more details: uuujf.github.io/postdoc/wu2025…