Piotr Nawrot (@p_nawrot) Twitter Tweets • TwiCopy

Piotr Nawrot

@p_nawrot

+ Follow

PhD student in NLP @Edin_CDT_NLP | Previously intern @Nvidia @AIatMeta @cohere | 🥇🥈@ Polish Championships in Flunkyball

ID: 2694126977

linkhttps://piotrnawrot.github.io calendar_today10-07-2014 06:50:37

342 Tweet

6,6K Followers

257 Following

Piotr Nawrot

@p_nawrot

5 months ago

Tomorrow at 6pm CET I'm giving a talk about our latest work on Sparse Attention, at Cohere Labs. I plan to describe the field as it is now, discuss our evaluation results, and share insights about what I believe is the future of Sparse Attention. See you!

thumb_up_off_alt33

chat_bubble_outline0

repeat3

shareShare

Piotr Nawrot

@p_nawrot

5 months ago

We release a major improvement upon last year's Dynamic Memory Compression. DMS is better, easier, and faster to train. Future of Long Context is 1) KV Cache Compression + 2) Sparse Attention, both training-aware to avoid training-inference mismatch. Imho, DMS is SOTA for 1).

thumb_up_off_alt12

chat_bubble_outline0

repeat4

shareShare

AK

@_akhaliq

5 months ago

Nvidia presents Inference-Time Hyper-Scaling with KV Cache Compression

thumb_up_off_alt511

chat_bubble_outline7

repeat58

shareShare

Marktechpost AI Research News ⚡

@marktechpost

5 months ago

NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs As the demand for reasoning-heavy tasks grows, large language models (LLMs) are increasingly expected to generate longer sequences or parallel chains of reasoning.

thumb_up_off_alt33

chat_bubble_outline0

repeat13

shareShare

Edoardo Ponti

@pontiedoardo

5 months ago

Last week marked the end of my stay as a visiting professor at NVIDIA During my time there, I became passionate about the idea of LLMs modulating sequence length to create faster, broader-horizon LLM architectures Let me retrace these steps 🧵

thumb_up_off_alt53

chat_bubble_outline3

repeat4

shareShare

vLLM

@vllm_project

4 months ago

glad to see how researchers explore the flexibility of vLLM while still enjoying the performance benefit😁

thumb_up_off_alt62

chat_bubble_outline0

repeat4

shareShare

Huiqiang Jiang

@iofu728

4 months ago

A very good abstraction of sparse attention in vLLM!🥳

thumb_up_off_alt7

chat_bubble_outline1

repeat1

shareShare

Beidi Chen

@beidichen

4 months ago

This is cool!!!

thumb_up_off_alt30

chat_bubble_outline2

repeat3

shareShare

Luca Perić

@lp_peric

4 months ago

The Bitter Lesson is coming for Tokenization The Byte Latent Transformer (BLT) showed the possibility of finding additional scaling laws related to removing tokenization but the topic seemed to get little proper coverage...

thumb_up_off_alt25

chat_bubble_outline3

repeat6

shareShare

Ori Press

@ori_press

4 months ago

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

thumb_up_off_alt141

chat_bubble_outline6

repeat54

shareShare

Sukjun (June) Hwang

@sukjun_hwang

4 months ago

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

thumb_up_off_alt2,2K

chat_bubble_outline58

repeat355

shareShare

Albert Gu

@_albertgu

4 months ago

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

thumb_up_off_alt1,1K

chat_bubble_outline58

repeat177

shareShare

Edoardo Ponti

@pontiedoardo

4 months ago

Thanks for acknowledging Dynamic Token Pooling as a predecessor to H-Net, Albert Gu! We had some decent ideas in that paper (e2e and entropy-based tokenisation), but it surprises me that it took 2 years (an eternity in NLP) to find the right recipe and scale better than BPE

thumb_up_off_alt76

chat_bubble_outline1

repeat8

shareShare

Edoardo Ponti

@pontiedoardo

4 months ago

If you are at ICML Conference make sure to attend Adrian Lancucki’s invited talk on our inference-time *hyper*-scaling paper (and more!) at the tokenization workshop this Friday tokenization-workshop.github.io/schedule/

thumb_up_off_alt20

chat_bubble_outline0

repeat2

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

4 months ago

The TokShop schedule is now live! Join us at #ICML2025 for invited talks, poster sessions, and a panel on the future of tokenization. tokenization-workshop.github.io/schedule #Tokenization #LLM #NLProc

thumb_up_off_alt6

chat_bubble_outline0

repeat3

shareShare

Piotr Nawrot

@p_nawrot

3 months ago

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Yiping Lu

@2prime_pku

3 months ago

Anyone knows adam?

thumb_up_off_alt3,3K

chat_bubble_outline208

repeat327

shareShare

Simone Scardapane

@s_scardapane

3 months ago

*The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs* by Piotr Nawrot Edoardo Ponti Kelly Marchisio (St. Denis) Sebastian Ruder They study sparse attention techniques at scale, comparing to small dense models at the same compute budget. arxiv.org/abs/2504.17768

*The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs*
by <a href="/p_nawrot/">Piotr Nawrot</a> <a href="/PontiEdoardo/">Edoardo Ponti</a> <a href="/cheeesio/">Kelly Marchisio (St. Denis)</a> <a href="/seb_ruder/">Sebastian Ruder</a>

They study sparse attention techniques at scale, comparing to small dense models at the same compute budget.

arxiv.org/abs/2504.17768

thumb_up_off_alt178

chat_bubble_outline2

repeat26

shareShare