Valentin Hofmann (@vjhofmann) Twitter Tweets • TwiCopy

Sachin Kumar

7 months ago

Check out this new paper (led by Jake Tae and Hamish Ivison) where we train a generalist instruction following diffusion LM. It took a lot of effort to make it work. Many cool details to check out 👇. One of my favorite parts was reward model guidance which allows you to plug and

thumb_up_off_alt24

chat_bubble_outline0

repeat7

shareShare

Valentin Hofmann

@vjhofmann

7 months ago

Now out in PNASNews: Even when LLMs are explicitly unbiased, their outputs still systematically reflect implicit biases against minoritized groups. Make sure to check out this important paper! 🚨

thumb_up_off_alt16

chat_bubble_outline1

repeat2

shareShare

Chan Young Park

@chan_young_park

6 months ago

⭐️Looking for a PhD Intern⭐️ Join me this summer at MSR to work on personal AI agents! We're developing innovative models to enhance personalized MS Copilot experiences. I'm seeking candidates with strong modeling skills and experience with LLM (multi-)agents/preference learning

thumb_up_off_alt80

chat_bubble_outline2

repeat18

shareShare

Nathan Lambert

@natolambert

6 months ago

A very exciting day for open-source AI! We're releasing our biggest open source model yet -- OLMo 2 32B -- and it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. As usual, all data, weights, code, etc. are available. For a long time,

thumb_up_off_alt956

chat_bubble_outline51

repeat155

shareShare

Valentin Hofmann

@vjhofmann

6 months ago

Humans store thousands of multi-word expressions like "of course" in their mental lexicon, but current tokenizers don't support multi-word tokens. Enter SuperBPE, a tokenizer that lifts this restriction and brings substantial gains in efficiency and performance! 🚀 Details 👇

thumb_up_off_alt36

chat_bubble_outline0

repeat6

shareShare

Jonathan Hayase

@jonathanhayase

6 months ago

Tokenizers govern the allocation of computation. It's a waste to spend a whole token of compute predicting the "way" in "By the way". SuperBPE redirects that compute to predict more difficult tokens, leading to wins on downstream tasks!

thumb_up_off_alt47

chat_bubble_outline1

repeat8

shareShare

Benjamin Minixhofer

@bminixhofer

5 months ago

We created Approximate Likelihood Matching, a principled (and very effective) method for *cross-tokenizer distillation*! With ALM, you can create ensembles of models from different families, convert existing subword-level models to byte-level and a bunch more🧵

thumb_up_off_alt90

chat_bubble_outline2

repeat28

shareShare

Valentin Hofmann

@vjhofmann

5 months ago

Delighted there will finally be a workshop devoted to tokenization - a critical topic for LLMs and beyond! 🎉 Join us for the inaugural edition of TokShop at #ICML2025 ICML Conference in Vancouver this summer! 🤗

thumb_up_off_alt28

chat_bubble_outline0

repeat4

shareShare

Neel Bhandari

@neelbhandari9

5 months ago

1/🚨 𝗡𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗮𝗹𝗲𝗿𝘁 🚨 RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style? We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵

thumb_up_off_alt59

chat_bubble_outline1

repeat17

shareShare

Ian Magnusson

@ianmagnusson

4 months ago

Excited to share that DataDecide, our suite of language models pretrained over differences in data and scale, has been accepted at #ICML2025 💫 See you in Vancouver!

thumb_up_off_alt34

chat_bubble_outline0

repeat4

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

4 months ago

Got a tokenization paper that just didn't make the cut for ICML? Submit it to the Tokenization Workshop TokShop at #ICML2025 -- we'd love to see it there! tokenization-workshop.github.io

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

Ai2

@allen_ai

4 months ago

Do LLMs learn language via rules or analogies? This could be a surprise to many – models rely heavily on stored examples and draw analogies when dealing with unfamiliar words, much as humans do. Check out this new study led by Valentin Hofmann to learn how they made the discovery 💡

thumb_up_off_alt34

chat_bubble_outline3

repeat4

shareShare

Valentin Hofmann

@vjhofmann

4 months ago

Excited to see our study on linguistic generalization in LLMs featured by University of Oxford News!

thumb_up_off_alt19

chat_bubble_outline0

repeat2

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

4 months ago

Language matters: Low-resource languages are severely overtokenized: While English uses ~1.2 tokens per word, e.g., Tamil requires more tokens than characters, making #LLMs much costlier for billions of speakers! 💸🌍 Check out our ICML workshop 🔗 tokenization-workshop.github.io

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Clémentine Fourrier 🍊

@clefourrier

3 months ago

Textbook example of sleazy eval reporting: - metric definition hidden in font 4 - pass@1 _averaged over 10 trials_ is not pass@1: model actually can't be compared to competitors in table - reports 2 scores: highest uses test time compute + rm bad runs + internal scoring model...

thumb_up_off_alt83

chat_bubble_outline9

repeat11

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

3 months ago

Beyond text: Modern AI tokenizes images, too! Vision models split photos into patches, treating each 16x16 pixel square as a "token." 🖼️➡️🔤 #VisualTokenization Interested in tokenization? Join our workshop tokenization-workshop.github.io The submission deadline is already May 30!

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

3 months ago

Got a good tokenization paper under review at COLM, but the scores were a letdown? 😬 Why bother with rebuttal when the perfect venue is right around the corner! Submit your paper to the #ICML2025 Tokenization Workshop (TokShop) by May 30! 🚀

thumb_up_off_alt6

chat_bubble_outline0

repeat5

shareShare

Oreva Ahia

@orevaahia

3 months ago

🚨 Reminder: Paper submissions for the 1st Tokenization Workshop (TokShop) at #ICML2025 are due today May 30! 🔗CFP: tokenization-workshop.github.io

thumb_up_off_alt16

chat_bubble_outline1

repeat5

shareShare