Jordan Juravsky (@jordanjuravsky) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Simon Guo 🦝

@simonguozirui

2 months ago

grad students are so GPU poor that we can only launch 1⃣ kernel ... but wait, it is faster!

thumb_up_off_alt99

chat_bubble_outline1

repeat8

shareShare

So so so cool. Llama 1B batch one inference in one single CUDA kernel, deleting synchronization boundaries imposed by breaking the computation into a series of kernels called in sequence. The *optimal* orchestration of compute and memory is only achievable in this way.

thumb_up_off_alt2,2K

chat_bubble_outline63

repeat299

shareShare

Owen Dugan

@owendugan

2 months ago

A megakernel for Llama!🦙 We built a single kernel for the entire Llama 1B forward pass, enabling >1000 tokens/s on a single H100 and almost 1500 tokens/s on a single B200! Check it out!

thumb_up_off_alt17

chat_bubble_outline0

repeat2

shareShare

Dan Biderman

@dan_biderman

2 months ago

Local LLMs *privately* collaborating with smarter cloud LLMs, as if you never left your laptop. Pure joy to work with ollama.

thumb_up_off_alt33

chat_bubble_outline1

repeat8

shareShare

Simon Guo 🦝

@simonguozirui

2 months ago

I LOVE 🫶 using Tokasaurus 🦖🔥 for my research over the last few months! Jordan Juravsky and team have made it so easy to use and super high throughput across a variety of models and hardware configurations, making these test-time / throughput-heavy experiments even possible

thumb_up_off_alt14

chat_bubble_outline1

repeat1

shareShare

Simon Guo 🦝

@simonguozirui

2 months ago

In fact, the test-time experiments and synthetic data generations for KernelBench were only possible with tokasaurus!

thumb_up_off_alt3

chat_bubble_outline0

repeat1

shareShare

Azalia Mirhoseini

@azaliamirh

2 months ago

In the test time scaling era, we all would love a higher throughput serving engine! Introducing Tokasaurus, a LLM inference engine for high-throughput workloads with large and small models! Led by Jordan Juravsky, in collaboration with hazyresearch and an amazing team!

thumb_up_off_alt139

chat_bubble_outline2

repeat20

shareShare

Sabri Eyuboglu

@eyuboglusabri

2 months ago

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

thumb_up_off_alt287

chat_bubble_outline12

repeat66

shareShare

Hermann

@kumbonghermann

2 months ago

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from

thumb_up_off_alt49

chat_bubble_outline1

repeat21

shareShare

Rylan Schaeffer

@rylanschaeffer

2 months ago

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉 TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to

thumb_up_off_alt106

chat_bubble_outline4

repeat14

shareShare

Jordan Juravsky

@jordanjuravsky

2 months ago

Cartridges, powered by Tokasaurus! 🤝⚡️🦖

thumb_up_off_alt11

chat_bubble_outline0

repeat3

shareShare

Jon Saad-Falcon

@jonsaadfalcon

a month ago

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning

thumb_up_off_alt204

chat_bubble_outline11

repeat56

shareShare

Jerry Liu

@jerrywliu

a month ago

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:

thumb_up_off_alt579

chat_bubble_outline12

repeat109

shareShare

Jordan Juravsky

Gate.io

Simon Guo 🦝

Andrej Karpathy

Owen Dugan

Dan Biderman

Simon Guo 🦝

Simon Guo 🦝

Azalia Mirhoseini

Sabri Eyuboglu

Hermann

Rylan Schaeffer

Jordan Juravsky

Jon Saad-Falcon

Jerry Liu