Piotr Nawrot (@p_nawrot) 's Twitter Profile
Piotr Nawrot

@p_nawrot

PhD student in NLP @Edin_CDT_NLP | Previously intern @Nvidia @AIatMeta @cohere | 🥇🥈@ Polish Championships in Flunkyball

ID: 2694126977

linkhttps://piotrnawrot.github.io calendar_today10-07-2014 06:50:37

342 Tweet

6,6K Followers

257 Following

Piotr Nawrot (@p_nawrot) 's Twitter Profile Photo

Tomorrow at 6pm CET I'm giving a talk about our latest work on Sparse Attention, at Cohere Labs. I plan to describe the field as it is now, discuss our evaluation results, and share insights about what I believe is the future of Sparse Attention. See you!

Piotr Nawrot (@p_nawrot) 's Twitter Profile Photo

We release a major improvement upon last year's Dynamic Memory Compression. DMS is better, easier, and faster to train. Future of Long Context is 1) KV Cache Compression + 2) Sparse Attention, both training-aware to avoid training-inference mismatch. Imho, DMS is SOTA for 1).

Marktechpost AI Research News ⚡ (@marktechpost) 's Twitter Profile Photo

NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs As the demand for reasoning-heavy tasks grows, large language models (LLMs) are increasingly expected to generate longer sequences or parallel chains of reasoning.

NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs

As the demand for reasoning-heavy tasks grows, large language models (LLMs) are increasingly expected to generate longer sequences or parallel chains of reasoning.
Edoardo Ponti (@pontiedoardo) 's Twitter Profile Photo

Last week marked the end of my stay as a visiting professor at NVIDIA During my time there, I became passionate about the idea of LLMs modulating sequence length to create faster, broader-horizon LLM architectures Let me retrace these steps 🧵

Luca Perić (@lp_peric) 's Twitter Profile Photo

The Bitter Lesson is coming for Tokenization The Byte Latent Transformer (BLT) showed the possibility of finding additional scaling laws related to removing tokenization but the topic seemed to get little proper coverage...

Ori Press (@ori_press) 's Twitter Profile Photo

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

Do language models have algorithmic creativity?

To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
Sukjun (June) Hwang (@sukjun_hwang) 's Twitter Profile Photo

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

Albert Gu (@_albertgu) 's Twitter Profile Photo

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence.

Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Edoardo Ponti (@pontiedoardo) 's Twitter Profile Photo

Thanks for acknowledging Dynamic Token Pooling as a predecessor to H-Net, Albert Gu! We had some decent ideas in that paper (e2e and entropy-based tokenisation), but it surprises me that it took 2 years (an eternity in NLP) to find the right recipe and scale better than BPE

Edoardo Ponti (@pontiedoardo) 's Twitter Profile Photo

If you are at ICML Conference make sure to attend Adrian Lancucki’s invited talk on our inference-time *hyper*-scaling paper (and more!) at the tokenization workshop this Friday tokenization-workshop.github.io/schedule/

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

The TokShop schedule is now live! Join us at #ICML2025 for invited talks, poster sessions, and a panel on the future of tokenization. tokenization-workshop.github.io/schedule #Tokenization #LLM #NLProc

The TokShop schedule is now live! Join us at #ICML2025 for invited talks, poster sessions, and a panel on the future of tokenization. tokenization-workshop.github.io/schedule #Tokenization #LLM #NLProc
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs* by Piotr Nawrot Edoardo Ponti Kelly Marchisio (St. Denis) Sebastian Ruder They study sparse attention techniques at scale, comparing to small dense models at the same compute budget. arxiv.org/abs/2504.17768

*The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs*
by <a href="/p_nawrot/">Piotr Nawrot</a> <a href="/PontiEdoardo/">Edoardo Ponti</a> <a href="/cheeesio/">Kelly Marchisio (St. Denis)</a> <a href="/seb_ruder/">Sebastian Ruder</a>

They study sparse attention techniques at scale, comparing to small dense models at the same compute budget.

arxiv.org/abs/2504.17768