Xinyang (Young) Geng (@younggeng) 's Twitter Profile
Xinyang (Young) Geng

@younggeng

Research scientist at Google DeepMind. Opinions are my own.

ID: 2362406610

linkhttp://young-geng.xyz/ calendar_today26-02-2014 09:17:53

66 Tweet

1,1K Followers

513 Following

Lechao Xiao (@locchiu) 's Twitter Profile Photo

1/5. Excited to share a spicy paper, "Rethinking conventional wisdom in machine learning: from generalization to scaling", arxiv.org/pdf/2409.15156. You might love it or dislike it! NotebookLM: notebooklm.google.com/notebook/43f11… While double-descent (generalization-centric,

1/5. Excited to share a spicy paper, "Rethinking conventional wisdom in machine learning: from generalization to scaling", arxiv.org/pdf/2409.15156.  
You might love it or dislike it!  
NotebookLM: notebooklm.google.com/notebook/43f11…
While double-descent (generalization-centric,
Charlie Snell (@sea_snell) 's Twitter Profile Photo

Good post-training data is precious and scarce; compute is less so. We should focus on methods which squeeze more out of existing data by spending additional compute per datapoint, rather than optimizing for cheaper post-training methods

Cristian Garcia (@cgarciae88) 's Twitter Profile Photo

People learning JAX, feel free to reach out if the learning feels too steep, hopefully we can flatten it out. Also, checkout the JAX LLM for help from the community: discord.gg/m9NDrmENe2

Charlie Snell (@sea_snell) 's Twitter Profile Photo

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵
Jerry Tworek (@millionint) 's Twitter Profile Photo

People completely misunderstand the data wall. It's the data slop wall. Most of the data is so bad it's a waste of a good gpu to backprop it.

Jack Rae (@jack_w_rae) 's Twitter Profile Photo

Appreciate Aidan McLaughlin looking into the thinking model results. Originally scores looked weak as the response was plucked from the thought content versus output. We are looking into ways of making thinking output less confusing for people running evals. This is why we 🚢, to

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

It’s done because it’s much easier to 1) collect, 2) evaluate, and 3) beat and make progress on. We’re going to see every task that is served neatly packaged on a platter like this improved (including those that need PhD-grade expertise). But jobs (even intern-level) that need

Jim Fan (@drjimfan) 's Twitter Profile Photo

Whether you like it or not, the future of AI will not be canned genies controlled by a "safety panel". The future of AI is democratization. Every internet rando will run not just o1, but o8, o9 on their toaster laptop. It's the tide of history that we should surf on, not swim

Whether you like it or not, the future of AI will not be canned genies controlled by a "safety panel". The future of AI is democratization. Every internet rando will run not just o1, but o8, o9 on their toaster laptop. It's the tide of history that we should surf on, not swim
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

For friends of open source: imo the highest leverage thing you can do is help construct a high diversity of RL environments that help elicit LLM cognitive strategies. To build a gym of sorts. This is a highly parallelizable task, which favors a large community of collaborators.

Hieu Pham (@hyhieu226) 's Twitter Profile Photo

Despite many complaints about Jax being hard to use, it has a crucial advantage over PyTorch: for distributed jobs, XLA is sufficiently good at auto-scheduling parallelism strategies, e.g., sharding, overlapping compute and comms. If PyTorch becomes good at that, it's checkmate.

Jacob Austin (@jacobaustin132) 's Twitter Profile Photo

Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n

Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n
rdyro (@rdyro128523) 's Twitter Profile Photo

Deepseek R1 inference in pure JAX! Currently on TPU, with GPU and distilled models in-progress. Features MLA-style attention, expert/tensor parallelism & int8 quantization. Contributions welcome!

Deepseek R1 inference in pure JAX! Currently on TPU, with GPU and distilled models in-progress. Features MLA-style attention, expert/tensor parallelism & int8 quantization. Contributions welcome!
Jack Rae (@jack_w_rae) 's Twitter Profile Photo

Today we are launching 2.5 Pro! I think it's the best model in the world. State-of-the-art reasoning and great vibes (+39 ELO gap on lmsys!) 2.5 Pro improves in coding, stem, multimodal, instruction following, and lots more. Available in AI Studio & the Gemini App!

Today we are launching 2.5 Pro!

I think it's the best model in the world. State-of-the-art reasoning and great vibes (+39 ELO gap on lmsys!)

2.5 Pro improves in coding, stem, multimodal, instruction following, and lots more. 

Available in AI Studio & the Gemini App!
Jack Rae (@jack_w_rae) 's Twitter Profile Photo

2.5 Flash is out! You can now specify thinking budgets, or disable thinking entirely for lower latency. Strong code & reasoning capabilities, cost effective, fast. It's a great workhorse model for enterprise and developers, excited to hear your feedback.

Jacob Austin (@jacobaustin132) 's Twitter Profile Photo

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n