Hamish Ivison (@hamishivi) Twitter Tweets • TwiCopy

Alisa Liu

6 months ago

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

thumb_up_off_alt2,2K

chat_bubble_outline96

repeat322

shareShare

Hamish Ivison

@hamishivi

5 months ago

generation creds to Nathan Lambert 🤩

generation creds to <a href="/natolambert/">Nathan Lambert</a> 🤩

thumb_up_off_alt15

chat_bubble_outline1

repeat0

shareShare

Hamish Ivison

@hamishivi

5 months ago

Excited to be back home in Australia (Syd/Melb) for most of April! Email or DM if you want to grab a coffee :)

thumb_up_off_alt26

chat_bubble_outline0

repeat0

shareShare

Ai2

@allen_ai

5 months ago

For years it’s been an open question — how much is a language model learning and synthesizing information, and how much is it just memorizing and reciting? Introducing OLMoTrace, a new feature in the Ai2 Playground that begins to shed some light. 🔦

thumb_up_off_alt638

chat_bubble_outline17

repeat167

shareShare

Kabir

@kabirahuja004

5 months ago

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

thumb_up_off_alt241

chat_bubble_outline3

repeat46

shareShare

Hamish Ivison

@hamishivi

5 months ago

My favourite recently is still sometimes when training a model to do multiplication it just.... repeats the question in a sympy-friendly way, since sympy performs the multiplication for it when parsing the answer. Emergent tool-use? 😅

thumb_up_off_alt17

chat_bubble_outline1

repeat0

shareShare

Hamish Ivison

@hamishivi

4 months ago

v cool stuff 😁

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Hamish Ivison

@hamishivi

4 months ago

I’ll be around for this! Come ask us questions about olmo and tulu :)

thumb_up_off_alt15

chat_bubble_outline0

repeat9

shareShare

Ai2

@allen_ai

4 months ago

We’re live on Reddit! Ask us Anything about our OLMo family of models. We have six of our researchers on hand to answer all your questions.

thumb_up_off_alt85

chat_bubble_outline3

repeat29

shareShare

Hamish Ivison

@hamishivi

4 months ago

'all u need' rn: good base + rl loop + online tools + grader model + clever filtering

thumb_up_off_alt13

chat_bubble_outline1

repeat0

shareShare

Rui Xin

@rui_xin31

4 months ago

Think PII scrubbing ensures privacy? 🤔Think again‼️ In our paper, for the first time on unstructured text, we show that you can re-identify over 70% of private information *after* scrubbing! It’s time to move beyond surface-level anonymization. #Privacy #NLProc 🔗🧵

thumb_up_off_alt50

chat_bubble_outline2

repeat19

shareShare

Tong Chen @ ICLR

@tomchen0

4 months ago

LLMs naturally memorize some verbatim of pre-training data. We study whether post-training can be an effective way to mitigate unintentional reproduction of pre-training data. 🛠️ No changes to pre-training or decoding 🔥 Training models to latently distinguish between memorized

thumb_up_off_alt98

chat_bubble_outline1

repeat30

shareShare

Hamish Ivison

@hamishivi

4 months ago

Learnt a lot from working with Costa! Truly a GOAT of OSS RL infra 🫡 sad to see him go but excited for what's next 😁

thumb_up_off_alt19

chat_bubble_outline1

repeat0

shareShare

Hamish Ivison

@hamishivi

4 months ago

🤔

thumb_up_off_alt9

chat_bubble_outline1

repeat0

shareShare

Hamish Ivison

@hamishivi

3 months ago

Was fun to see this come together! Personally, i think it highlights the fact that we should evaluate rlvr findings across multiple model families (maybe including olmo wink wink nudge nudge?)

thumb_up_off_alt21

chat_bubble_outline0

repeat4

shareShare

Shan Chen

@shan23chen

3 months ago

‼️ 1/n Ask your reasoning model to think in lower resource language does degrade models’ performance at the moment. My awesome Co-author already communicated the main points in the thread, I will just communicate some random things we learned in my 🧵

thumb_up_off_alt16

chat_bubble_outline1

repeat8

shareShare

Ai2

@allen_ai

3 months ago

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.

thumb_up_off_alt136

chat_bubble_outline2

repeat21

shareShare

Jacqueline He

@jcqln_h

3 months ago

LMs often output answers that sound right but aren’t supported by input context. This is intrinsic hallucination: the generation of plausible, but unsupported content. We propose Precise Information Control (PIC): a task requiring LMs to ground only on given verifiable claims.

thumb_up_off_alt43

chat_bubble_outline1

repeat18

shareShare

Stella Li

@stellalisy

3 months ago

Spurious Rewards was not all‼️We now present spurious PROMPTS🤔 check out our latest findings and discussion on evaluation: tinyurl.com/spurious-prompt. Who knew Lorem ipsum can bring 19.4% gains compared to default prompt👀 Also, arXiv is out🤩 arxiv.org/abs/2506.10947📄

thumb_up_off_alt182

chat_bubble_outline6

repeat26

shareShare