Ted Zadouri (@tedzadouri) Twitter Tweets • TwiCopy

Ted Zadouri

@tedzadouri

+ Follow

PhD Student @PrincetonCS || Previously: @CohereForAI @UCLA

ID: 1273146181951139840

calendar_today17-06-2020 06:51:50

157 Tweet

355 Followers

224 Following

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

6 months ago

Hardware-Efficient Attention for Fast Decoding Princeton optimizes decoding by maximizing arithmetic intensity (FLOPs/byte) for better memory–compute efficiency: - GTA (Grouped-Tied Attention) Ties key/value states + partial RoPE → 2× arithmetic intensity vs. GQA, ½ KV cache,

thumb_up_off_alt196

chat_bubble_outline4

repeat46

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

6 months ago

Hardware-Efficient Attention for Fast Decoding "We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a

thumb_up_off_alt117

chat_bubble_outline3

repeat17

shareShare

Tri Dao

@tri_dao

6 months ago

We've been thinking about what the "ideal" architecture should look like in the era where inference is driving AI progress. GTA & GLA are steps in this direction: attention variants tailored for inference: high arithmetic intensity (make GPUs go brr even during decoding), easy to

thumb_up_off_alt447

chat_bubble_outline7

repeat50

shareShare

Hub

@hubstrauss

6 months ago

Great project where we rethink attention for inference: Grouped-Tied Attn (GTA) ties the KV, and Grouped Latent Attn (GLA) shards low-rank latents across GPUs. Results: high arithmetic intensity, high quality models and parallel-friendly. Kudos to the team !!

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Wentao Guo

@wentaoguo7

4 months ago

🦆🚀QuACK🦆🚀: new SOL mem-bound kernel library without a single line of CUDA C++ all straight in Python thanks to CuTe-DSL. On H100 with 3TB/s, it performs 33%-50% faster than highly optimized libraries like PyTorch's torch.compile and Liger. 🤯 With Ted Zadouri and Tri Dao

thumb_up_off_alt316

chat_bubble_outline11

repeat66

shareShare

Tri Dao

@tri_dao

4 months ago

Getting mem-bound kernels to speed-of-light isn't a dark art, it's just about getting the a couple of details right. We wrote a tutorial on how to do this, with code you can directly use. Thanks to the new CuTe-DSL, we can hit speed-of-light without a single line of CUDA C++.

thumb_up_off_alt520

chat_bubble_outline7

repeat57

shareShare

Mayank Mishra

@mayankmish98

4 months ago

🦆QuACK: blazing fast cute-DSL GPU kernels with 3TB/s goodness! Optimizing your kernels as much as possible is important... unless you are okay with leaving throughput on the table. check out this work from vlaw, Ted Zadouri and Tri Dao

thumb_up_off_alt18

chat_bubble_outline0

repeat6

shareShare

Ted Zadouri

@tedzadouri

4 months ago

CuTe DSL feels almost unreal: minimal Python code hits peak memory throughput on H100, as we show in QuACK. Can't wait for the addition of kernels optimized for Blackwell in QuACK 🦆

thumb_up_off_alt21

chat_bubble_outline0

repeat1

shareShare

Princeton Computer Science

@princetoncs

4 months ago

Congrats to Parastoo Abtahi, Tri Dao and Alex Lombardi on being named 2025 Google Research Scholars. 🎉 The @googleresearch scholars program funds world-class research conducted by early-career professors. bit.ly/4kvpvFx

Congrats to <a href="/parastooabtahi/">Parastoo Abtahi</a>, <a href="/tri_dao/">Tri Dao</a> and Alex Lombardi on being named 2025 Google Research Scholars. 🎉

The @googleresearch scholars program funds world-class research conducted by early-career professors.

bit.ly/4kvpvFx

thumb_up_off_alt76

chat_bubble_outline0

repeat5

shareShare