Alisa Liu (@alisawuffles) Twitter Tweets • TwiCopy

Taylor Sorensen

6 months ago

🤔🤖Most AI systems assume there’s just one right answer—but many tasks have reasonable disagreement. How can we better model human variation? 🌍✨ We propose modeling at the individual-level using open-ended, textual value profiles! 🗣️📝 arxiv.org/abs/2503.15484 (1/?)

thumb_up_off_alt150

chat_bubble_outline3

repeat32

shareShare

Abhilasha Ravichander

@lasha_nlp

6 months ago

Want to know what training data has been memorized by models like GPT-4? We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models, without requiring access to 🙅‍♀️ Model weights 🙅‍♀️ Training data 🙅‍♀️ Token probabilities 🧵1/5

thumb_up_off_alt210

chat_bubble_outline4

repeat40

shareShare

Kevin Degila👨🏾‍💻

@kevindegila

6 months ago

Interesting approach. It actually makes sense to count some recurring expressions as a token. Great job.

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Andi Marafioti

@andimarafioti

6 months ago

This looks super interesting! There is a lot of room for improvement in tokenisation 💫

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

Creston Brooks

@crestonbrooks

6 months ago

Such a cool paper! Whitespace as a universal token delimiter = pretty arbitrary when there is little consensus what a "word" even is (esp. when you can save on inference)... there are counterexamples to any combination of criteria posed so far, e.g.: degruyter.com/document/doi/1…

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Pratyush Maini

@pratyushmaini

6 months ago

This is such a cool and intuitive modification to tokenization! And the results look just amazing both in terms of quality and inference speed.

thumb_up_off_alt20

chat_bubble_outline0

repeat1

shareShare

Matt Hartman

@matthartman

6 months ago

“Whitespace is not a reliable delimiter of meaning” - Alisa Liu Jonathan Hayase & team building SuperBPE

“Whitespace is not a reliable delimiter of meaning” - <a href="/alisawuffles/">Alisa Liu</a> <a href="/JonathanHayase/">Jonathan Hayase</a> & team building SuperBPE

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Xenova

@xenovacom

6 months ago

Love this! 🤗 SuperBPE is a *superword* tokenizer, which can encode multiple words using a single token (up to 33% more efficient than before)! 🤯 Plus, their official playground uses Transformers.js for in-browser tokenization and visualization! 🚀 Give it a try! 👇

thumb_up_off_alt71

chat_bubble_outline3

repeat15

shareShare

Etash Guha @ ICLR

@etash_guha

5 months ago

Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)

thumb_up_off_alt465

chat_bubble_outline19

repeat171

shareShare

Gonçalo Faria

@goncalorafaria

5 months ago

Introducing 𝗤𝗔𝗹𝗶𝗴𝗻🚀, a 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗺𝗲𝘁𝗵𝗼𝗱 that improves language model performance using Markov chain Monte Carlo. With no model retraining, 𝗤𝗔𝗹𝗶𝗴𝗻 outperforms DPO-tuned models even when allowed to match inference compute, and achieves

thumb_up_off_alt113

chat_bubble_outline4

repeat33

shareShare

Jiacheng Liu

@liujc1998

5 months ago

As infini-gram surpasses 500 million API calls, today we're announcing two exciting updates: 1. Infini-gram is now open-source under Apache 2.0! 2. We indexed the training data of OLMo 2 models. Now you can search in the training data of these strong, fully-open LLMs. 🧵 (1/4)

thumb_up_off_alt65

chat_bubble_outline2

repeat12

shareShare

Jiacheng Liu

@liujc1998

5 months ago

Today we're unveiling OLMoTrace, a tool that enables everyone to understand the outputs of LLMs by connecting to their training data. We do this on unprecedented scale and in real time: finding matching text between model outputs and 4 trillion training tokens within seconds. ✨

thumb_up_off_alt281

chat_bubble_outline9

repeat46

shareShare

Ai2

@allen_ai

5 months ago

"OLMoTrace is a breakthrough in AI development, setting a new standard for transparency and trust. We hope it will empower researchers, developers, and users to build with confidence—on models they can understand and trust." - CEO Ali Farhadi at tonight's fireside chat with

thumb_up_off_alt38

chat_bubble_outline1

repeat51

shareShare

Ian Magnusson

@ianmagnusson

5 months ago

🔭 Science relies on shared artifacts collected for the common good. 🛰 So we asked: what's missing in open language modeling? 🪐 DataDecide 🌌 charts the cosmos of pretraining—across scales and corpora—at a resolution beyond any public suite of models that has come before.

thumb_up_off_alt88

chat_bubble_outline4

repeat62

shareShare

Guy is WRITING THE BOOK

@nosilverv

5 months ago

First they came for "delve", and I did not speak out—because I was not a delver. Then they came for the em dash and

thumb_up_off_alt522

chat_bubble_outline13

repeat27

shareShare

Ximing Lu

@gximing

5 months ago

With the rise of R1, search seems out of fashion? We prove the opposite! 😎 Introducing Retro-Search 🌈: an MCTS-inspired search algorithm that RETROspectively revises R1’s reasoning traces to synthesize untaken, new reasoning paths that are better 💡, yet shorter in length ⚡️.

thumb_up_off_alt250

chat_bubble_outline5

repeat102

shareShare

Peter West

@peterwesttm

4 months ago

I’ve been fascinated lately by the question: what kinds of capabilities might base LLMs lose when they are aligned? i.e. where can alignment make models WORSE? I’ve been looking into this with Christopher Potts and here's one piece of the answer: randomness and creativity

thumb_up_off_alt348

chat_bubble_outline11

repeat59

shareShare

Ai2

@allen_ai

4 months ago

📢We’re taking your questions now on Reddit for tomorrow’s AMA! Ask us anything about OLMo, our family of fully-open language models. Our researchers will be on hand to answer them Thursday, May 8 at 8am PST.

thumb_up_off_alt28

chat_bubble_outline16

repeat26

shareShare

Wenting Zhao

@wzhao_nlp

4 months ago

Some personal news: I'll join UMass Amherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!

thumb_up_off_alt833

chat_bubble_outline95

repeat31

shareShare

Hyunwoo Kim

@hyunw_kim

4 months ago

📢I'm thrilled to announce that I’ll be joining @KAIST_AI as an Assistant Professor in 2026, leading the Computation & Cognition (COCO) Lab🤖🧠: coco-kaist.github.io We'll be exploring reasoning, learning w/ synthetic data, and social agents! +I'm spending a gap year NVIDIA✨

thumb_up_off_alt330

chat_bubble_outline32

repeat20

shareShare