Nicholas Lourie (@nicklourie) 's Twitter Profile
Nicholas Lourie

@nicklourie

Better empirical methods for deep learning. PhD at @nyuniversity (@CILVRatNYU). Advised by @kchonyc and @hhexiy. Prev: @allen_ai.

I build things. 🤖

ID: 2370922034

linkhttps://github.com/nicholaslourie/opda calendar_today03-03-2014 20:43:21

33 Tweet

1,1K Followers

1,1K Following

NYU Center for Data Science (@nyudatascience) 's Twitter Profile Photo

CDS Prof. Kyunghyun Cho (Kyunghyun Cho) has published two new papers, urging a reevaluation of how progress in AI is measured. Are we advancing or just repeating history? Learn more: nyudatascience.medium.com/separating-hyp…

Jane Pan (@janepan_) 's Twitter Profile Photo

Do LLMs exploit imperfect proxies of human preference in context? Yes! In fact, they do it so severely that iterative refinement can make outputs worse when judged by actual humans. In other words, reward hacking can occur even without gradient updates! w/ He He,

Do LLMs exploit imperfect proxies of human preference in context? Yes!

In fact, they do it so severely that iterative refinement can make outputs worse when judged by actual humans. In other words, reward hacking can occur even without gradient updates!

w/ <a href="/hhexiy/">He He</a>,
Siavash Golkar (@siavashgolkar) 's Twitter Profile Photo

SOTA models often use bidirectional transformers for non-NLP tasks but did you know causal transformers can outperform them even on tasks without a causal structure? Our recent work shows causal transformers learn circuits bidirectional ones can't, leading to better performance!

SOTA models often use bidirectional transformers for non-NLP tasks but did you know causal transformers can outperform them even on tasks without a causal structure?

Our recent work shows causal transformers learn circuits bidirectional ones can't, leading to better performance!
Mayee Chen (@mayeechen) 's Twitter Profile Photo

There are many algorithms for constructing pre-training data mixtures—which one should we use? Turns out: many of them fall under one framework, have similar issues, and can be improved with a straightforward modification. Introducing Aioli! 🧄 1/9

There are many algorithms for constructing pre-training data mixtures—which one should we use? Turns out: many of them fall under one framework, have similar issues, and can be improved with a straightforward modification.

Introducing Aioli! 🧄 1/9
Michael Hu (@michahu8) 's Twitter Profile Photo

So you want a good pretraining data mix🧑‍🍳, but which data mixing algorithm do you pick? DoGE, DoReMi, Skill-it, grid searching proportions… 😵‍💫 It turns out that these algorithms are all special cases of Linear Mixing Optimization (LMO), our new data mixing framework! 🧵

So you want a good pretraining data mix🧑‍🍳, but which data mixing algorithm do you pick? DoGE, DoReMi, Skill-it, grid searching proportions… 😵‍💫

It turns out that these algorithms are all special cases of Linear Mixing Optimization (LMO), our new data mixing framework! 🧵
Nicholas Lourie (@nicklourie) 's Twitter Profile Photo

If scaling no longer makes economic sense, what does that mean for research?? Will we see more work on architecture and fundamentals again? Or, will the current spread of topics remain unchanged? 🤔

alphaXiv (@askalphaxiv) 's Twitter Profile Photo

Finding good data mixtures for LLM training can be tricky - Aioli provides a unified framework to construct pre-training data mixtures. Talk to the authors Mayee Chen Michael Hu Nicholas Lourie Kyunghyun Cho Chris Re hazyresearch directly here!

Finding good data mixtures for LLM training can be tricky - Aioli provides a unified framework to construct pre-training data mixtures. Talk to the authors <a href="/MayeeChen/">Mayee Chen</a> <a href="/michahu8/">Michael Hu</a> <a href="/NickLourie/">Nicholas Lourie</a> <a href="/kchonyc/">Kyunghyun Cho</a>
Chris Re <a href="/HazyResearch/">hazyresearch</a> directly here!
Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Adding Error Bars to Evals. AI model evaluations don’t usually include statistics or uncertainty. We think they should. Read the blog post here: anthropic.com/research/stati…

Nicholas Lourie (@nicklourie) 's Twitter Profile Photo

I missed this great paper when it came out last year! TL;DR: Prediction errors from models trained with different random seeds become independent as training converges---at least for the image classification tasks they consider. This finding has important implications if you're

Nicholas Lourie (@nicklourie) 's Twitter Profile Photo

Anthropic put out a great primer on statistical methods for LLM evals by Evan Miller. Check out his blog too! He's written gems on A/B testing and other topics---just make sure you don't mind losing an afternoon like I did when I first came across it! 😆 evanmiller.org

Charlie Snell (@sea_snell) 's Twitter Profile Photo

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵
Nicholas Lourie (@nicklourie) 's Twitter Profile Photo

A great idea by Charlie Snell: Use finetuning to predict where zero-shot capabilities emerge. This lets you experiment at a smaller scale. The more finetuning data you have, the smaller of a model you can use. Here's how I think about it: a one-time cost collecting data saves you

Michael Hu (@michahu8) 's Twitter Profile Photo

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜arxiv.org/abs/2507.00885 🧵1/5👇

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule.

a quick read about scaling law fails: 
📜arxiv.org/abs/2507.00885

🧵1/5👇