Sander Land (@magikarp_tokens) Twitter Tweets • TwiCopy

Sander Land

@magikarp_tokens

+ Follow

Staff MLE @ Cohere | Breaking all the models with weird tokens

ID: 1771865972934135808

linkhttps://github.com/cohere-ai/magikarp calendar_today24-03-2024 11:46:04

113 Tweet

928 Followers

78 Following

Sander Land

@magikarp_tokens

5 months ago

It was great to be able to present my glitch token work at AI Mad Lab in Oslo. As always, great community, excellent conversations, looking forward to many more meetups.

It was great to be able to present my glitch token work at <a href="/aimadlab/">AI Mad Lab</a> in Oslo. As always, great community, excellent conversations, looking forward to many more meetups.

thumb_up_off_alt18

chat_bubble_outline0

repeat3

shareShare

Sander Land

@magikarp_tokens

5 months ago

Quasar Alpha uses cl200k, like gpt4o. ម្បី means what for max glitchiness and occasional errors. Also seems to induce some reasoning-like traces at times.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

3 months ago

📝 Submit papers (up to 9 pages, shorter submission ) via OpenReview: openreview.net/group?id=ICML.… 🗓️ Important dates: Deadline: May 30, 2025 Notifications: June 9, 2025 Workshop: July 18, 2025 Both archival and non-archival options available! #ICML2025 #TokShop #ML #NLProc

thumb_up_off_alt3

chat_bubble_outline0

repeat3

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

3 months ago

Did you know BPE (Byte Pair Encoding), the most common LLM tokenizer, was originally a compression algorithm from 1994? #Tokenization #LLM #NLP Want to find out more about tokenization? Join our workshop at ICML! tokenization-workshop.github.io

thumb_up_off_alt6

chat_bubble_outline0

repeat2

shareShare

Catherine Arnett

@linguist_cat

3 months ago

Sander and I have been working on a new encoding scheme for tokenization which mitigates variable length byte sequences for different scripts, prevents partial UTF-8 byte tokens, and offers a simple and efficient pretokenization alternative to regular expressions!

thumb_up_off_alt23

chat_bubble_outline2

repeat5

shareShare

Catherine Arnett

@linguist_cat

3 months ago

In other words: (graphic design by Sander Land)

In other words:
(graphic design by <a href="/magikarp_tokens/">Sander Land</a>)

thumb_up_off_alt5

chat_bubble_outline1

repeat1

shareShare

Saumya Malik

@saumyamalik44

3 months ago

I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!

thumb_up_off_alt218

chat_bubble_outline4

repeat46

shareShare

Saumya Malik

@saumyamalik44

3 months ago

Thank you to co-authors Nathan Lambert, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, and Hanna Hajishirzi for a great collaboration! Read more in the paper here (ArXiv soon!): github.com/allenai/reward… Dataset, leaderboard, and models here: huggingface.co/collections/al…

thumb_up_off_alt13

chat_bubble_outline2

repeat3

shareShare

Sara Hooker

@sarahookr

2 months ago

Huge congrats to all the authors Diana Abagyan, Alejandro, Felipe Cruz-Salinas, Kris Cao, John Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün. I always enjoy collabs which tackle learning efficiency as an explicit design choice — rather than post training fixes. arxiv.org/abs/2506.10766

thumb_up_off_alt12

chat_bubble_outline0

repeat3

shareShare

EleutherAI

@aieleuther

2 months ago

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by Catherine Arnett on tokenizers, their limitations, and how to improve them.

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members.

Our first talk is by <a href="/linguist_cat/">Catherine Arnett</a> on tokenizers, their limitations, and how to improve them.

thumb_up_off_alt146

chat_bubble_outline2

repeat21

shareShare