Jordan Juravsky (@jordanjuravsky) 's Twitter Profile
Jordan Juravsky

@jordanjuravsky

AI PhD Student at Stanford, proud former goose at UWaterloo.

ID: 1421558207739142150

linkhttps://jordanjuravsky.com calendar_today31-07-2021 19:48:16

80 Tweet

601 Followers

210 Following

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

So so so cool. Llama 1B batch one inference in one single CUDA kernel, deleting synchronization boundaries imposed by breaking the computation into a series of kernels called in sequence. The *optimal* orchestration of compute and memory is only achievable in this way.

Owen Dugan (@owendugan) 's Twitter Profile Photo

A megakernel for Llama!🦙 We built a single kernel for the entire Llama 1B forward pass, enabling >1000 tokens/s on a single H100 and almost 1500 tokens/s on a single B200! Check it out!

Simon Guo 🦝 (@simonguozirui) 's Twitter Profile Photo

I LOVE 🫶 using Tokasaurus 🦖🔥 for my research over the last few months! Jordan Juravsky and team have made it so easy to use and super high throughput across a variety of models and hardware configurations, making these test-time / throughput-heavy experiments even possible

Azalia Mirhoseini (@azaliamirh) 's Twitter Profile Photo

In the test time scaling era, we all would love a higher throughput serving engine! Introducing Tokasaurus, a LLM inference engine for high-throughput workloads with large and small models! Led by Jordan Juravsky, in collaboration with hazyresearch and an amazing team!

In the test time scaling era, we all would love a higher throughput serving engine! Introducing Tokasaurus, a LLM inference engine for high-throughput workloads with large and small models!

Led by <a href="/jordanjuravsky/">Jordan Juravsky</a>, in collaboration with <a href="/HazyResearch/">hazyresearch</a> and an amazing team!
Sabri Eyuboglu (@eyuboglusabri) 's Twitter Profile Photo

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size.

What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x
Hermann (@kumbonghermann) 's Twitter Profile Photo

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week.

VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from
Rylan Schaeffer (@rylanschaeffer) 's Twitter Profile Photo

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉 TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to

A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉

TLDR: Best of N was shown to exhibit power (polynomial) law scaling (left), but maths suggest one should expect exponential scaling (center). We show how to
Jon Saad-Falcon (@jonsaadfalcon) 's Twitter Profile Photo

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 
🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning
Jerry Liu (@jerrywliu) 's Twitter Profile Photo

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works: