Jason Lee (@jasondeanlee) 's Twitter Profile
Jason Lee

@jasondeanlee

Associate Professor at Princeton. Former Research Scientist at Google DeepMind. ML/AI Researcher working on foundations of LLMs and deep learning

ID: 1055883888159932416

linkhttp://jasondlee88.github.io calendar_today26-10-2018 18:08:31

1,1K Tweet

13,13K Followers

3,3K Following

Aryeh Kontorovich (@aryehazan) 's Twitter Profile Photo

On the Statistical Query Complexity of Learning Semiautomata: a Random Walk Approach George Giapitzakis, Kimon Fountoulakis, Eshaan Nichani, Jason D. Lee Jason Lee arxiv.org/abs/2510.04115…

Kimon Fountoulakis (@kfountou) 's Twitter Profile Photo

I am hiring one PhD student. Subject: Reasoning and AI, with a focus on computational learning for long reasoning processes such as automated theorem proving and the learnability of algorithmic tasks. Preferred background: A mathematics student interested in transitioning to

I am hiring one PhD student.

Subject: Reasoning and AI, with a focus on computational learning for long reasoning processes such as automated theorem proving and the learnability of algorithmic tasks.

Preferred background: A mathematics student interested in transitioning to
david (@davidtsong) 's Twitter Profile Photo

apparently the smart way to recruit SW engineers is to value high school brand > college average Lynbrook alumni engineer 10x > average Stanford engineer

Sham Kakade (@shamkakade6) 's Twitter Profile Photo

1/8 Second Order Optimizers like SOAP and Muon have shown impressive performance on LLM optimization. But are we fully utilizing the potential of second order information? New work: we show that a full second order optimizer is much better than existing optimizers in terms of

1/8 Second Order Optimizers like SOAP and Muon have shown impressive performance on LLM optimization. But are we fully utilizing the potential of second order information? New work: we show that a full second order optimizer is much better than existing optimizers in terms of
Konstantin Mishchenko (@konstmish) 's Twitter Profile Photo

Weight decay changes the training objective because the decay update can be conflicting the gradient update, so the equilibrium is no longer where the gradient is zero. This paper proposes a single-line edit that applies weight decay in a way that preserves the stationary points.

Weight decay changes the training objective because the decay update can be conflicting the gradient update, so the equilibrium is no longer where the gradient is zero. This paper proposes a single-line edit that applies weight decay in a way that preserves the stationary points.
UC Berkeley EECS (@berkeley_eecs) 's Twitter Profile Photo

UC Berkeley EECS is hiring! We're seeking exceptional faculty candidates at all ranks for our "Engineering + AI" search and up to 7 tenure-track Asst. Professors in EECS. EECS Focused Searches Include: Quantum Computing ⚛️ AI, Inequality, & Society ⚖️ bit.ly/40L2bwA

Damek (@damekdavis) 's Twitter Profile Photo

In this note w/ Ben Recht we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.

In this note w/ <a href="/beenwrekt/">Ben Recht</a> we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x:

max_θ 𝔼ₓ h(Prob(correct ∣ x; θ))

for certain functions h. Weirdly, h is arcsin(√t) in GRPO.
Sham Kakade (@shamkakade6) 's Twitter Profile Photo

1/6 Introducing Seesaw: a principled batch size scheduling algo. Seesaw achieves theoretically optimal serial run time given a fixed compute budget and also matches the performance of cosine annealing at fixed batch size.

1/6 Introducing Seesaw: a principled batch size scheduling algo. Seesaw achieves theoretically optimal serial run time given a fixed compute budget and also matches the performance of cosine annealing at fixed batch size.
Damek (@damekdavis) 's Twitter Profile Photo

a fun exercise for the autoformalization companies: > formalize a "gradient descent on neural networks learns xyz" style paper they are often ~100 pages of algebra, concentration, inequalities, and optimization. Beyond a few grad students, I'm not sure anyone has verified one.

Damek (@damekdavis) 's Twitter Profile Photo

Rota i would actually be curious for someone to formalize the tensor programs papers by Greg Yang. I could't quite get to a crisp statement of the results in those works. it would be nice to know what they are saying beyond the "folklore" explanation of feature learning people parrot.

Zhuoran Yang (@zhuoran_yang) 's Twitter Profile Photo

Imagine a research paradigm where nascent ideas evolve into fully realized papers, complete with empirical data, insightful figures, and robust citations, through an iterative, feedback-driven autonomous system. This vision guides our work. We introduce **freephdlabor**: a

Imagine a research paradigm where nascent ideas evolve into fully realized papers, complete with empirical data, insightful figures, and robust citations, through an iterative, feedback-driven autonomous system. This vision guides our work.

We introduce **freephdlabor**: a
Ravid Shwartz Ziv (@ziv_ravid) 's Twitter Profile Photo

Repeat after me: LLMs are not humans Is RL like how humans learn? No! Is SFT like how humans learn? No! Is the next token predication like humans? No! Is the next big thing in AI will be like humans? No! Does it matter? No! Thank you for your attention to this matter!