Sachin Kumar (@shocheen) 's Twitter Profile
Sachin Kumar

@shocheen

Assistant Professor at @OhioStateCSE. Hiring Ph.D. students (Fall '25).

Previous: @allen_ai, @UWNLP, @LTICMU. He/Him ๐Ÿณ๏ธโ€๐ŸŒˆ

ID: 267680298

linkhttp://shocheen.com calendar_today17-03-2011 10:39:58

424 Tweet

1,1K Followers

690 Following

Alisa Liu (@alisawuffles) 's Twitter Profile Photo

We created SuperBPE๐Ÿš€, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.๐Ÿงต

We created SuperBPE๐Ÿš€, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.๐Ÿงต
Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Humans store thousands of multi-word expressions like "of course" in their mental lexicon, but current tokenizers don't support multi-word tokens. Enter SuperBPE, a tokenizer that lifts this restriction and brings substantial gains in efficiency and performance! ๐Ÿš€ Details ๐Ÿ‘‡

Abhilasha Ravichander (@lasha_nlp) 's Twitter Profile Photo

Want to know what training data has been memorized by models like GPT-4? We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models, without requiring access to ๐Ÿ™…โ€โ™€๏ธ Model weights ๐Ÿ™…โ€โ™€๏ธ Training data ๐Ÿ™…โ€โ™€๏ธ Token probabilities ๐Ÿงต1/5

Want to know what training data has been memorized by models like GPT-4?

We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models,

without requiring access to
๐Ÿ™…โ€โ™€๏ธ Model weights
๐Ÿ™…โ€โ™€๏ธ Training data
๐Ÿ™…โ€โ™€๏ธ Token probabilities ๐Ÿงต1/5
Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

๐Ÿšจ NEW WORKSHOP ALERT ๐Ÿšจ We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 ICML Conference! ๐ŸŽ‰ Submissions are open for work on tokenization across all areas of machine learning. ๐Ÿ“… Submission deadline: May 30, 2025 ๐Ÿ”— tokenization-workshop.github.io

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

There has been a lot of chatter about tokenization for LLMs over the last few months, but tokenization goes beyond text-based models. It's time we bring the NLP and ML communities together to explore this foundational topic. Let's talk about tokenization at TokShop!

Oreva Ahia (@orevaahia) 's Twitter Profile Photo

Working on tokenization across any modality, text, audio, images, videos ? Submit your paper to our Tokenization Workshop at #ICML2025!

Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Delighted there will finally be a workshop devoted to tokenization - a critical topic for LLMs and beyond! ๐ŸŽ‰ Join us for the inaugural edition of TokShop at #ICML2025 ICML Conference in Vancouver this summer! ๐Ÿค—

Tuhin Chakrabarty (@tuhinchakr) 's Twitter Profile Photo

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.
Chan Young Park (@chan_young_park) 's Twitter Profile Photo

๐Ÿš€ Excited to share our #NAACL2025 paper on Language Model Personalization! arxiv.org/abs/2410.16027 Current RLHF methods often overlook *whose* preferences are being optimized. This can cause conflicting signals and models that mainly cater to the โ€œaverageโ€ or most dominant users

๐Ÿš€ Excited to share our #NAACL2025 paper on Language Model Personalization! arxiv.org/abs/2410.16027
Current RLHF methods often overlook *whose* preferences are being optimized. This can cause conflicting signals and models that mainly cater to the โ€œaverageโ€ or most dominant users
Chan Young Park (@chan_young_park) 's Twitter Profile Photo

While I'm on X to share my paper, I also have a life update I'll be joining School of Information - UT Austin as an assistant professor starting Fall 2026! Excited for this next chapter, and to keep working on teaching computers to better understand language and humans (+now teaching humans too)

Sachin Kumar (@shocheen) 's Twitter Profile Photo

I will be at #NAACL2025 next week to talk about this paper. Much work on personalizing LLMs focusses on explicit preferences, norms, values either directly optimized for or specified in the prompts/instructions. In this work, we study implicit preferences that may not

Sanchaita Hazra (@hsanchaita) 's Twitter Profile Photo

Very excited for a new #ICML2025 position paper accepted as oral w Bodhisattwa Majumder & Tuhin Chakrabarty! ๐Ÿ˜Ž What are the longitudinal harms of AI development? We use economic theories to highlight AIโ€™s intertemporal impacts on livelihoods & its role in deepening labor-market inequality.

Very excited for a new #ICML2025 position paper accepted as oral w <a href="/mbodhisattwa/">Bodhisattwa Majumder</a> &amp; <a href="/TuhinChakr/">Tuhin Chakrabarty</a>! ๐Ÿ˜Ž

What are the longitudinal harms of AI development?

We use economic theories to highlight AIโ€™s intertemporal impacts on livelihoods &amp; its role in deepening labor-market inequality.
Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

๐Ÿ“ฃ Call for Paper Alert: TokShop @ ICML 2025 TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

Got a good tokenization paper under review at COLM, but the scores were a letdown? ๐Ÿ˜ฌ Why bother with rebuttal when the perfect venue is right around the corner! Submit your paper to the #ICML2025 Tokenization Workshop (TokShop) by May 30! ๐Ÿš€