Sachin Kumar (@shocheen) 's Twitter Profile
Sachin Kumar

@shocheen

Assistant Professor at @OhioStateCSE. Hiring Ph.D. students (Fall '25).

Previous: @allen_ai, @UWNLP, @LTICMU. He/Him πŸ³οΈβ€πŸŒˆ

ID: 267680298

linkhttp://shocheen.com calendar_today17-03-2011 10:39:58

424 Tweet

1,1K Followers

690 Following

Alisa Liu (@alisawuffles) 's Twitter Profile Photo

We created SuperBPEπŸš€, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧡

We created SuperBPEπŸš€, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧡
Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Humans store thousands of multi-word expressions like "of course" in their mental lexicon, but current tokenizers don't support multi-word tokens. Enter SuperBPE, a tokenizer that lifts this restriction and brings substantial gains in efficiency and performance! πŸš€ Details πŸ‘‡

Abhilasha Ravichander (@lasha_nlp) 's Twitter Profile Photo

Want to know what training data has been memorized by models like GPT-4? We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models, without requiring access to πŸ™…β€β™€οΈ Model weights πŸ™…β€β™€οΈ Training data πŸ™…β€β™€οΈ Token probabilities 🧡1/5

Want to know what training data has been memorized by models like GPT-4?

We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models,

without requiring access to
πŸ™…β€β™€οΈ Model weights
πŸ™…β€β™€οΈ Training data
πŸ™…β€β™€οΈ Token probabilities 🧡1/5
Patrick Da Silva (@patrickqdasilva) 's Twitter Profile Photo

We report many aggregated results in our paper, and invite researchers to comb through the extensive results in our repository to build intuitions about model variance Our paper: arxiv.org/abs/2504.04635 Code, Data, Results, and Figures for all LMs: github.com/patqdasilva/st… (9/10)

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

🚨 NEW WORKSHOP ALERT 🚨 We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 ICML Conference! πŸŽ‰ Submissions are open for work on tokenization across all areas of machine learning. πŸ“… Submission deadline: May 30, 2025 πŸ”— tokenization-workshop.github.io

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

There has been a lot of chatter about tokenization for LLMs over the last few months, but tokenization goes beyond text-based models. It's time we bring the NLP and ML communities together to explore this foundational topic. Let's talk about tokenization at TokShop!

Oreva Ahia (@orevaahia) 's Twitter Profile Photo

Working on tokenization across any modality, text, audio, images, videos ? Submit your paper to our Tokenization Workshop at #ICML2025!

Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Delighted there will finally be a workshop devoted to tokenization - a critical topic for LLMs and beyond! πŸŽ‰ Join us for the inaugural edition of TokShop at #ICML2025 ICML Conference in Vancouver this summer! πŸ€—

Tuhin Chakrabarty (@tuhinchakr) 's Twitter Profile Photo

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.

Unlike math/code, writing lacks verifiable rewards. So all we get is slop. To solve this we train reward models on expert edits that beat SOTA #LLMs largely on a new Writing Quality benchmark. We also reduce #AI slop by using our RMs at test time boosting alignment with experts.
Chan Young Park (@chan_young_park) 's Twitter Profile Photo

πŸš€ Excited to share our #NAACL2025 paper on Language Model Personalization! arxiv.org/abs/2410.16027 Current RLHF methods often overlook *whose* preferences are being optimized. This can cause conflicting signals and models that mainly cater to the β€œaverage” or most dominant users

πŸš€ Excited to share our #NAACL2025 paper on Language Model Personalization! arxiv.org/abs/2410.16027
Current RLHF methods often overlook *whose* preferences are being optimized. This can cause conflicting signals and models that mainly cater to the β€œaverage” or most dominant users
Chan Young Park (@chan_young_park) 's Twitter Profile Photo

While I'm on X to share my paper, I also have a life update I'll be joining School of Information - UT Austin as an assistant professor starting Fall 2026! Excited for this next chapter, and to keep working on teaching computers to better understand language and humans (+now teaching humans too)

Sachin Kumar (@shocheen) 's Twitter Profile Photo

I will be at #NAACL2025 next week to talk about this paper. Much work on personalizing LLMs focusses on explicit preferences, norms, values either directly optimized for or specified in the prompts/instructions. In this work, we study implicit preferences that may not

Sanchaita Hazra (@hsanchaita) 's Twitter Profile Photo

Very excited for a new #ICML2025 position paper accepted as oral w Bodhisattwa Majumder & Tuhin Chakrabarty! 😎 What are the longitudinal harms of AI development? We use economic theories to highlight AI’s intertemporal impacts on livelihoods & its role in deepening labor-market inequality.

Very excited for a new #ICML2025 position paper accepted as oral w <a href="/mbodhisattwa/">Bodhisattwa Majumder</a> &amp; <a href="/TuhinChakr/">Tuhin Chakrabarty</a>! 😎

What are the longitudinal harms of AI development?

We use economic theories to highlight AI’s intertemporal impacts on livelihoods &amp; its role in deepening labor-market inequality.
Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

πŸ“£ Call for Paper Alert: TokShop @ ICML 2025 TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

Got a good tokenization paper under review at COLM, but the scores were a letdown? 😬 Why bother with rebuttal when the perfect venue is right around the corner! Submit your paper to the #ICML2025 Tokenization Workshop (TokShop) by May 30! πŸš€