Valentin Hofmann (@vjhofmann) 's Twitter Profile
Valentin Hofmann

@vjhofmann

Postdoc @allen_ai @uwnlp | Formerly @UniofOxford @CisLMU @stanfordnlp @GoogleDeepMind

ID: 1227633169622556672

linkhttps://valentinhofmann.github.io/ calendar_today12-02-2020 16:38:56

234 Tweet

1,1K Followers

243 Following

Sachin Kumar (@shocheen) 's Twitter Profile Photo

Check out this new paper (led by Jake Tae and Hamish Ivison) where we train a generalist instruction following diffusion LM. It took a lot of effort to make it work. Many cool details to check out 👇. One of my favorite parts was reward model guidance which allows you to plug and

Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Now out in PNASNews: Even when LLMs are explicitly unbiased, their outputs still systematically reflect implicit biases against minoritized groups. Make sure to check out this important paper! 🚨

Chan Young Park (@chan_young_park) 's Twitter Profile Photo

⭐️Looking for a PhD Intern⭐️ Join me this summer at MSR to work on personal AI agents! We're developing innovative models to enhance personalized MS Copilot experiences. I'm seeking candidates with strong modeling skills and experience with LLM (multi-)agents/preference learning

Nathan Lambert (@natolambert) 's Twitter Profile Photo

A very exciting day for open-source AI! We're releasing our biggest open source model yet -- OLMo 2 32B -- and it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. As usual, all data, weights, code, etc. are available. For a long time,

A very exciting day for open-source AI! We're releasing our biggest open source model yet -- OLMo 2 32B -- and it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. As usual, all data, weights, code, etc. are available.

For a long time,
Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Humans store thousands of multi-word expressions like "of course" in their mental lexicon, but current tokenizers don't support multi-word tokens. Enter SuperBPE, a tokenizer that lifts this restriction and brings substantial gains in efficiency and performance! 🚀 Details 👇

Jonathan Hayase (@jonathanhayase) 's Twitter Profile Photo

Tokenizers govern the allocation of computation. It's a waste to spend a whole token of compute predicting the "way" in "By the way". SuperBPE redirects that compute to predict more difficult tokens, leading to wins on downstream tasks!

Benjamin Minixhofer (@bminixhofer) 's Twitter Profile Photo

We created Approximate Likelihood Matching, a principled (and very effective) method for *cross-tokenizer distillation*! With ALM, you can create ensembles of models from different families, convert existing subword-level models to byte-level and a bunch more🧵

We created Approximate Likelihood Matching, a principled (and very effective) method for *cross-tokenizer distillation*!

With ALM, you can create ensembles of models from different families, convert existing subword-level models to byte-level and a bunch more🧵
Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

Delighted there will finally be a workshop devoted to tokenization - a critical topic for LLMs and beyond! 🎉 Join us for the inaugural edition of TokShop at #ICML2025 ICML Conference in Vancouver this summer! 🤗

Neel Bhandari (@neelbhandari9) 's Twitter Profile Photo

1/🚨 𝗡𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗮𝗹𝗲𝗿𝘁 🚨 RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style? We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵

1/🚨 𝗡𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗮𝗹𝗲𝗿𝘁 🚨
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?

We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵
Ian Magnusson (@ianmagnusson) 's Twitter Profile Photo

Excited to share that DataDecide, our suite of language models pretrained over differences in data and scale, has been accepted at #ICML2025 💫 See you in Vancouver!

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

Got a tokenization paper that just didn't make the cut for ICML? Submit it to the Tokenization Workshop TokShop at #ICML2025 -- we'd love to see it there! tokenization-workshop.github.io

Ai2 (@allen_ai) 's Twitter Profile Photo

Do LLMs learn language via rules or analogies? This could be a surprise to many – models rely heavily on stored examples and draw analogies when dealing with unfamiliar words, much as humans do. Check out this new study led by Valentin Hofmann to learn how they made the discovery 💡

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

Language matters: Low-resource languages are severely overtokenized: While English uses ~1.2 tokens per word, e.g., Tamil requires more tokens than characters, making #LLMs much costlier for billions of speakers! 💸🌍 Check out our ICML workshop 🔗 tokenization-workshop.github.io

Clémentine Fourrier 🍊 (@clefourrier) 's Twitter Profile Photo

Textbook example of sleazy eval reporting: - metric definition hidden in font 4 - pass@1 _averaged over 10 trials_ is not pass@1: model actually can't be compared to competitors in table - reports 2 scores: highest uses test time compute + rm bad runs + internal scoring model...

Textbook example of sleazy eval reporting:
- metric definition hidden in font 4
- pass@1 _averaged over 10 trials_ is not pass@1: model actually can't be compared to competitors in table
- reports 2 scores: highest uses test time compute + rm bad runs + internal scoring model...
Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

Beyond text: Modern AI tokenizes images, too! Vision models split photos into patches, treating each 16x16 pixel square as a "token." 🖼️➡️🔤 #VisualTokenization Interested in tokenization? Join our workshop tokenization-workshop.github.io The submission deadline is already May 30!

Tokenization Workshop (TokShop) @ICML2025 (@tokshop2025) 's Twitter Profile Photo

Got a good tokenization paper under review at COLM, but the scores were a letdown? 😬 Why bother with rebuttal when the perfect venue is right around the corner! Submit your paper to the #ICML2025 Tokenization Workshop (TokShop) by May 30! 🚀

Oreva Ahia (@orevaahia) 's Twitter Profile Photo

🚨 Reminder: Paper submissions for the 1st Tokenization Workshop (TokShop) at #ICML2025 are due today May 30! 🔗CFP: tokenization-workshop.github.io