Bradley Brown (@brad19brown) Twitter Tweets • TwiCopy

Dan Biderman

9 months ago

How can we use small LLMs to shift more AI workloads onto our laptops and phones? In our paper and open-source code, we pair on-device LLMs (ollama) with frontier LLMs in the cloud (@openai, @together), to solve token-intensive workloads on your 💻 at 17.5% of the cloud cost

thumb_up_off_alt600

chat_bubble_outline34

repeat165

shareShare

Avanika Narayan

@avanika15

9 months ago

we shipp’d 👭 on-device lms and frontier cloud lms. and…they were a match☺️. 98% accuracy, just 17.5% the cloud API costs beyond excited to drop minions: where local lms meet cloud lms 😊 joint work w/Sabri Eyuboglu & Dan Biderman at @hazyresearch. ty Together AI,

thumb_up_off_alt81

chat_bubble_outline6

repeat44

shareShare

Sabri Eyuboglu

@eyuboglusabri

9 months ago

All these on-device models are coming out (e.g. llama 3.2). But how can we actually make them useful for hard reasoning workloads (beyond iMessage summarization)? Our idea: give the on-device models your long context and let them communicate with frontier models in the cloud.

thumb_up_off_alt32

chat_bubble_outline0

repeat11

shareShare

Simon Guo 🦝

@simonguozirui

9 months ago

LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇

thumb_up_off_alt305

chat_bubble_outline9

repeat68

shareShare

Benjamin F Spector

@bfspector

9 months ago

(1/7) Inspired by DeepSeek's FlashMLA, we're releasing ThunderMLA—a fused megakernel optimized for variable-prompt decoding! ⚡️🐱ThunderMLA is up to 35% faster than FlashMLA and just 400 LoC. Blog: bit.ly/4kubAAK With Aaryan Singhal, Dan Fu, and @hazyresearch!

thumb_up_off_alt370

chat_bubble_outline7

repeat70

shareShare

Benjamin F Spector

@bfspector

8 months ago

(1/6) Joyously announcing ThunderKittens with real support on NVIDIA Blackwell! We've released BF16/FP8 GEMM and attention fwd+bwd kernels, up to 2x faster than cuBLAS GEMMs on H100. Blog: bit.ly/41tuT4Q With Dan Fu, Aaryan Singhal, and @hazyresearch!

thumb_up_off_alt190

chat_bubble_outline4

repeat29

shareShare

hazyresearch

@hazyresearch

8 months ago

The Great American AI Race. I wrote something about how we need a holistic AI effort from academia, industry, and the US government to have the best shot at a freer, better educated, and healthier world in AI. I’m a mega bull on the US and open source AI. Maybe we’re cooking

thumb_up_off_alt85

chat_bubble_outline1

repeat93

shareShare

Jordan Juravsky

@jordanjuravsky

8 months ago

When studying repeated sampling in Large Language Monkeys, we found that the relationship between log(pass@k) and the number of samples often follows a power law. But *why* do we see this scaling law? At first glance, this is surprising, since for a single problem pass@k and k

thumb_up_off_alt21

chat_bubble_outline0

repeat6

shareShare

Azalia Mirhoseini

@azaliamirh

8 months ago

In Large Language Monkeys, we showed the scaling laws of inference-time compute with repeated sampling--the power law relationship between the number of repeated attempts and the fraction of problems solved! The following amazing work theoretically proves the necessary and

$In Large Language Monkeys, we showed the scaling laws of inference-time compute with repeated sampling--the power law relationship between the number of repeated attempts and the fraction of problems solved! The following amazing work theoretically proves the necessary and$

thumb_up_off_alt170

chat_bubble_outline0

repeat33

shareShare

Azalia Mirhoseini

@azaliamirh

7 months ago

Excited to release SWiRL: A synthetic data generation and multi-step RL approach for reasoning and tool use! With SWiRL, the model’s capability generalizes to new tasks and tools. For example, a model trained to use a retrieval tool to solve multi-hop knowledge-intensive

thumb_up_off_alt390

chat_bubble_outline3

repeat74

shareShare

Anna Goldie

@annadgoldie

7 months ago

Excited to share our new paper on Step-Wise Reinforcement Learning (SWiRL), which uses reinforcement learning and synthetic trajectories to improve multi-step reasoning and tool use! (1/8)

thumb_up_off_alt133

chat_bubble_outline5

repeat21

shareShare

Benjamin F Spector

@bfspector

6 months ago

(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint

thumb_up_off_alt863

chat_bubble_outline32

repeat142

shareShare

Jordan Juravsky

@jordanjuravsky

6 months ago

We wrote a megakernel! Excited to share how we fused Llama-1B into a single kernel to reach SOTA latency. Check out our blog post and code below!

thumb_up_off_alt64

chat_bubble_outline3

repeat9

shareShare

Bradley Brown

@brad19brown

6 months ago

Happy Throughput Thursday to those who celebrate!

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Sabri Eyuboglu

@eyuboglusabri

6 months ago

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

thumb_up_off_alt287

chat_bubble_outline12

repeat66

shareShare

Ryan Ehrlich

@ryansehrlich

6 months ago

Giving LLMs very large amounts of context can be really useful, but it can also be slow and expensive. Could scaling inference time compute help? In our latest work, we show that allowing models to spend test time compute to “self-study” a large corpora can >20x decode

thumb_up_off_alt33

chat_bubble_outline0

repeat7

shareShare

Jon Saad-Falcon

@jonsaadfalcon

5 months ago

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning

thumb_up_off_alt204

chat_bubble_outline11

repeat56

shareShare

Jerry Liu

@jerrywliu

5 months ago

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:

thumb_up_off_alt579

chat_bubble_outline12

repeat109

shareShare

Jacky Kwok

@jackyk02

5 months ago

✨ Test-Time Scaling for Robotics ✨ Excited to release 🤖 RoboMonkey, which characterizes test-time scaling laws for Vision-Language-Action (VLA) models and introduces a framework that significantly improves the generalization and robustness of VLAs! 🧵(1 / N) 🌐 Website:

thumb_up_off_alt69

chat_bubble_outline2

repeat17

shareShare

Azalia Mirhoseini

@azaliamirh

4 months ago

Looking forward to attending ICML! Here are some works on memory/long context, verification, kernel design, multi-model AI systems, and theoretical understanding of test-time scaling from my awesome students and collaborators!

thumb_up_off_alt86

chat_bubble_outline3

repeat17

shareShare