Sachin Kumar (@shocheen) Twitter Tweets • TwiCopy

Alisa Liu

6 months ago

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

thumb_up_off_alt2,2K

chat_bubble_outline96

repeat322

shareShare

Valentin Hofmann

@vjhofmann

6 months ago

Humans store thousands of multi-word expressions like "of course" in their mental lexicon, but current tokenizers don't support multi-word tokens. Enter SuperBPE, a tokenizer that lifts this restriction and brings substantial gains in efficiency and performance! 🚀 Details 👇

thumb_up_off_alt36

chat_bubble_outline0

repeat6

shareShare

Abhilasha Ravichander

@lasha_nlp

6 months ago

Want to know what training data has been memorized by models like GPT-4? We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models, without requiring access to 🙅‍♀️ Model weights 🙅‍♀️ Training data 🙅‍♀️ Token probabilities 🧵1/5

thumb_up_off_alt210

chat_bubble_outline4

repeat40

shareShare

Patrick Da Silva

@patrickqdasilva

5 months ago

We report many aggregated results in our paper, and invite researchers to comb through the extensive results in our repository to build intuitions about model variance Our paper: arxiv.org/abs/2504.04635 Code, Data, Results, and Figures for all LMs: github.com/patqdasilva/st… (9/10)

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

5 months ago

🚨 NEW WORKSHOP ALERT 🚨 We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 ICML Conference! 🎉 Submissions are open for work on tokenization across all areas of machine learning. 📅 Submission deadline: May 30, 2025 🔗 tokenization-workshop.github.io

thumb_up_off_alt30

chat_bubble_outline2

repeat12

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

5 months ago

There has been a lot of chatter about tokenization for LLMs over the last few months, but tokenization goes beyond text-based models. It's time we bring the NLP and ML communities together to explore this foundational topic. Let's talk about tokenization at TokShop!

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

5 months ago

In the upcoming weeks, we will announce an exciting line-up of invited talks and panelists. Follow our account Tokenization Workshop (TokShop) @ICML2025 to stay tuned. Join us at TokShop at #ICML2025!

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

Sachin Kumar

@shocheen

5 months ago

Super excited that this workshop is finally happening. Mark your calendars!

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare

Oreva Ahia

@orevaahia

5 months ago

Working on tokenization across any modality, text, audio, images, videos ? Submit your paper to our Tokenization Workshop at #ICML2025!

thumb_up_off_alt34

chat_bubble_outline0

repeat8

shareShare

Tomasz Limisiewicz

@tomlimi

5 months ago

It’s finally official: the long-awaited Tokenization Workshop is here! 🔡🤩

thumb_up_off_alt12

chat_bubble_outline0

repeat3

shareShare

Valentin Hofmann

@vjhofmann

5 months ago

Delighted there will finally be a workshop devoted to tokenization - a critical topic for LLMs and beyond! 🎉 Join us for the inaugural edition of TokShop at #ICML2025 ICML Conference in Vancouver this summer! 🤗

thumb_up_off_alt28

chat_bubble_outline0

repeat4

shareShare

Tuhin Chakrabarty

@tuhinchakr

5 months ago

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

thumb_up_off_alt202

chat_bubble_outline10

repeat31

shareShare

Chan Young Park

@chan_young_park

4 months ago

🚀 Excited to share our #NAACL2025 paper on Language Model Personalization! arxiv.org/abs/2410.16027 Current RLHF methods often overlook *whose* preferences are being optimized. This can cause conflicting signals and models that mainly cater to the “average” or most dominant users

thumb_up_off_alt84

chat_bubble_outline2

repeat15

shareShare

Chan Young Park

@chan_young_park

4 months ago

While I'm on X to share my paper, I also have a life update I'll be joining School of Information - UT Austin as an assistant professor starting Fall 2026! Excited for this next chapter, and to keep working on teaching computers to better understand language and humans (+now teaching humans too)

thumb_up_off_alt220

chat_bubble_outline23

repeat16

shareShare

Sachin Kumar

@shocheen

4 months ago

I will be at #NAACL2025 next week to talk about this paper. Much work on personalizing LLMs focusses on explicit preferences, norms, values either directly optimized for or specified in the prompts/instructions. In this work, we study implicit preferences that may not

thumb_up_off_alt35

chat_bubble_outline0

repeat6

shareShare

Sachin Kumar

@shocheen

4 months ago

I will present this paper this afternoon at 2pm in Hall 3 (Poster Session 5). Please stop by! #NAACL2025

thumb_up_off_alt24

chat_bubble_outline0

repeat2

shareShare

Sanchaita Hazra

@hsanchaita

4 months ago

Very excited for a new #ICML2025 position paper accepted as oral w Bodhisattwa Majumder & Tuhin Chakrabarty! 😎 What are the longitudinal harms of AI development? We use economic theories to highlight AI’s intertemporal impacts on livelihoods & its role in deepening labor-market inequality.

Very excited for a new #ICML2025 position paper accepted as oral w <a href="/mbodhisattwa/">Bodhisattwa Majumder</a> & <a href="/TuhinChakr/">Tuhin Chakrabarty</a>! 😎

What are the longitudinal harms of AI development?

We use economic theories to highlight AI’s intertemporal impacts on livelihoods & its role in deepening labor-market inequality.

thumb_up_off_alt46

chat_bubble_outline3

repeat12

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

4 months ago

📣 Call for Paper Alert: TokShop @ ICML 2025 TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.

thumb_up_off_alt6

chat_bubble_outline1

repeat4

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

3 months ago

Got a good tokenization paper under review at COLM, but the scores were a letdown? 😬 Why bother with rebuttal when the perfect venue is right around the corner! Submit your paper to the #ICML2025 Tokenization Workshop (TokShop) by May 30! 🚀

thumb_up_off_alt6

chat_bubble_outline0

repeat5

shareShare