Sander Land (@magikarp_tokens) 's Twitter Profile
Sander Land

@magikarp_tokens

Staff MLE @ Cohere | Breaking all the models with weird tokens

ID: 1771865972934135808

linkhttps://github.com/cohere-ai/magikarp calendar_today24-03-2024 11:46:04

113 Tweet

928 Followers

78 Following

Sander Land (@magikarp_tokens) 's Twitter Profile Photo

It was great to be able to present my glitch token work at AI Mad Lab in Oslo. As always, great community, excellent conversations, looking forward to many more meetups.

It was great to be able to present my glitch token work at <a href="/aimadlab/">AI Mad Lab</a> in Oslo. As always, great community, excellent conversations, looking forward to many more meetups.
Sander Land (@magikarp_tokens) 's Twitter Profile Photo

Quasar Alpha uses cl200k, like gpt4o. ម្បី means what for max glitchiness and occasional errors. Also seems to induce some reasoning-like traces at times.

Quasar Alpha uses cl200k, like gpt4o. 
ម្បី means what
for max glitchiness and occasional errors. Also seems to induce some reasoning-like traces at times.
Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

📝 Submit papers (up to 9 pages, shorter submission ) via OpenReview: openreview.net/group?id=ICML.… 🗓️ Important dates: Deadline: May 30, 2025 Notifications: June 9, 2025 Workshop: July 18, 2025 Both archival and non-archival options available! #ICML2025 #TokShop #ML #NLProc

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

Did you know BPE (Byte Pair Encoding), the most common LLM tokenizer, was originally a compression algorithm from 1994? #Tokenization #LLM #NLP Want to find out more about tokenization? Join our workshop at ICML! tokenization-workshop.github.io

Catherine Arnett (@linguist_cat) 's Twitter Profile Photo

Sander and I have been working on a new encoding scheme for tokenization which mitigates variable length byte sequences for different scripts, prevents partial UTF-8 byte tokens, and offers a simple and efficient pretokenization alternative to regular expressions!

Saumya Malik (@saumyamalik44) 's Twitter Profile Photo

I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!

I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!
Saumya Malik (@saumyamalik44) 's Twitter Profile Photo

Thank you to co-authors Nathan Lambert, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, and Hanna Hajishirzi for a great collaboration! Read more in the paper here (ArXiv soon!): github.com/allenai/reward… Dataset, leaderboard, and models here: huggingface.co/collections/al…

Sara Hooker (@sarahookr) 's Twitter Profile Photo

Huge congrats to all the authors Diana Abagyan, Alejandro, Felipe Cruz-Salinas, Kris Cao, John Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün. I always enjoy collabs which tackle learning efficiency as an explicit design choice — rather than post training fixes. arxiv.org/abs/2506.10766

EleutherAI (@aieleuther) 's Twitter Profile Photo

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by Catherine Arnett on tokenizers, their limitations, and how to improve them.

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members.

Our first talk is by <a href="/linguist_cat/">Catherine Arnett</a> on tokenizers, their limitations, and how to improve them.