Csordás Róbert (@robert_csordas) 's Twitter Profile
Csordás Róbert

@robert_csordas

Postdoc at Stanford working on systematic generalization and algorithmic reasoning. Ex IDSIA PhD, Ex @DeepMind intern.

ID: 745005274784751616

linkhttps://robertcsordas.github.io/ calendar_today20-06-2016 21:27:54

169 Tweet

762 Followers

426 Following

The TWIML AI Podcast (@twimlai) 's Twitter Profile Photo

Today, we're joined by Julie Kallini ✨, PhD student at Stanford NLP Group to discuss her recent papers, “MrT5: Dynamic Token Merging for Efficient Byte-level Language Models” and “Mission: Impossible Language Models.” For the MrT5 paper, we explore the importance and failings of

Julie Kallini ✨ @ ICLR 2025 ✈️ (@juliekallini) 's Twitter Profile Photo

🚀 In T-minus 1 week, I’ll be at ICLR presenting MrT5! The final version has tons of updates: - New controller algorithm for targeted compression rates - More baselines and downstream tasks - Scaled-up experiments to 1.23B parameter models And now, MrT5 is on 🤗HuggingFace! 🧵

🚀 In T-minus 1 week, I’ll be at ICLR presenting MrT5!

The final version has tons of updates:
- New controller algorithm for targeted compression rates
- More baselines and downstream tasks
- Scaled-up experiments to 1.23B parameter models

And now, MrT5 is on 🤗HuggingFace! 🧵
Jürgen Schmidhuber (@schmidhuberai) 's Twitter Profile Photo

My first work on metalearning or learning to learn came out in 1987 [1][2]. Back then nobody was interested. Today, compute is 10 million times cheaper, and metalearning is a hot topic 🙂 It’s fitting that my 100th journal publication [100] is about metalearning, too. [100]

My first work on metalearning or learning to learn came out in 1987 [1][2]. Back then nobody was interested. Today, compute is 10 million times cheaper, and metalearning is a hot topic 🙂 It’s fitting that my 100th journal publication [100] is about metalearning, too.

[100]
Shikhar (@shikharmurty) 's Twitter Profile Photo

New #NAACL2025 paper! 🚨 Transformer LMs are data hungry, we propose a new auxiliary loss function (TreeReg) to fix that. TreeReg takes bracketing decisions from syntax trees and turns them into orthogonality constraints on span representations. ✅ Boosts pre-training data

Julie Kallini ✨ @ ICLR 2025 ✈️ (@juliekallini) 's Twitter Profile Photo

If you're at #ICLR2025 this week, come check out my poster for 💪MrT5 on Thursday (4/24) from 10am to 12:30pm! The poster is at Hall 3 + Hall 2B #273. I'll also be giving a ⚡ lightning talk right after at the session on tokenizer-free, end-to-end architectures in Opal 103-104!

Piotr Piękos (@piotrpiekosai) 's Twitter Profile Photo

What if instead of a couple of dense attention heads, we use lots of sparse heads, each learning to select its own set of tokens to process? Introducing Mixture of Sparse Attention (MoSA)

What if instead of a couple of dense attention heads, we use lots of sparse heads, each learning to select its own set of tokens to process?

Introducing Mixture of Sparse Attention (MoSA)
fly51fly (@fly51fly) 's Twitter Profile Photo

[LG] Do Language Models Use Their Depth Efficiently? R Csordás, C D. Manning, C Potts [Stanford University] (2025) arxiv.org/abs/2505.13898

[LG] Do Language Models Use Their Depth Efficiently?
R Csordás, C D. Manning, C Potts [Stanford University] (2025)
arxiv.org/abs/2505.13898
William Merrill (@lambdaviking) 's Twitter Profile Photo

Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀 New work with Ashish Sabharwal addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵

Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀

New work with <a href="/Ashish_S_AI/">Ashish Sabharwal</a> addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵
Aryaman Arora (@aryaman2020) 's Twitter Profile Photo

new paper! 🫡 why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!

new paper! 🫡

why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!
Mehrdad Farajtabar (@mfarajtabar) 's Twitter Profile Photo

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching?

The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,
David Chiang (@davidweichiang) 's Twitter Profile Photo

New on arXiv: Knee-Deep in C-RASP, by Andy J Yang, Michael Cadilhac and me. The solid stepped line is our theoretical prediction based on what problems C-RASP can solve, and the numbers/colors are what transformers (no position embedding) can learn.

New on arXiv: Knee-Deep in C-RASP, by <a href="/pentagonalize/">Andy J Yang</a>, Michael Cadilhac and me. The solid stepped line is our theoretical prediction based on what problems C-RASP can solve, and the numbers/colors are what transformers (no position embedding) can learn.
Nouha Dziri (@nouhadziri) 's Twitter Profile Photo

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? 

Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬

 We built a benchmark to find out → OMEGA Ω 📐

💥 We found