Hamish Ivison (@hamishivi) 's Twitter Profile
Hamish Ivison

@hamishivi

Antipodean Abroad. I (try to) do NLP research.
PhD student @uwcse,
prev @Sydney_Uni @allen_ai
🇦🇺🇨🇦🇬🇧

ID: 713618528578895872

linkhttp://hamishivi.github.io calendar_today26-03-2016 06:48:10

322 Tweet

1,1K Followers

672 Following

Alisa Liu (@alisawuffles) 's Twitter Profile Photo

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵
Ai2 (@allen_ai) 's Twitter Profile Photo

For years it’s been an open question — how much is a language model learning and synthesizing information, and how much is it just memorizing and reciting? Introducing OLMoTrace, a new feature in the Ai2 Playground that begins to shed some light. 🔦

Kabir (@kabirahuja004) 's Twitter Profile Photo

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

📢 New Paper!

Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎

W/ <a href="/melaniesclar/">Melanie Sclar</a>, and <a href="/tsvetshop/">tsvetshop</a>

1/n
Hamish Ivison (@hamishivi) 's Twitter Profile Photo

My favourite recently is still sometimes when training a model to do multiplication it just.... repeats the question in a sympy-friendly way, since sympy performs the multiplication for it when parsing the answer. Emergent tool-use? 😅

My favourite recently is still sometimes when training a model to do multiplication it just.... repeats the question in a sympy-friendly way, since sympy performs the multiplication for it when parsing the answer.

Emergent tool-use? 😅
Ai2 (@allen_ai) 's Twitter Profile Photo

We’re live on Reddit! Ask us Anything about our OLMo family of models. We have six of our researchers on hand to answer all your questions.

We’re live on Reddit! Ask us Anything about our OLMo family of models. We have six of our researchers on hand to answer all your questions.
Rui Xin (@rui_xin31) 's Twitter Profile Photo

Think PII scrubbing ensures privacy? 🤔Think again‼️ In our paper, for the first time on unstructured text, we show that you can re-identify over 70% of private information *after* scrubbing! It’s time to move beyond surface-level anonymization. #Privacy #NLProc 🔗🧵

Think PII scrubbing ensures privacy? 🤔Think again‼️ In our paper, for the first time on unstructured text, we show that you can re-identify over 70% of private information *after* scrubbing! It’s time to move beyond surface-level anonymization. #Privacy #NLProc 🔗🧵
Tong Chen @ ICLR (@tomchen0) 's Twitter Profile Photo

LLMs naturally memorize some verbatim of pre-training data. We study whether post-training can be an effective way to mitigate unintentional reproduction of pre-training data. 🛠️ No changes to pre-training or decoding 🔥 Training models to latently distinguish between memorized

LLMs naturally memorize some verbatim of pre-training data. We study whether post-training can be an effective way to mitigate unintentional reproduction of pre-training data.
🛠️ No changes to pre-training or decoding
🔥 Training models to latently distinguish between memorized
Hamish Ivison (@hamishivi) 's Twitter Profile Photo

Learnt a lot from working with Costa! Truly a GOAT of OSS RL infra 🫡 sad to see him go but excited for what's next 😁

Hamish Ivison (@hamishivi) 's Twitter Profile Photo

Was fun to see this come together! Personally, i think it highlights the fact that we should evaluate rlvr findings across multiple model families (maybe including olmo wink wink nudge nudge?)

Shan Chen (@shan23chen) 's Twitter Profile Photo

‼️ 1/n Ask your reasoning model to think in lower resource language does degrade models’ performance at the moment. My awesome Co-author already communicated the main points in the thread, I will just communicate some random things we learned in my 🧵

Ai2 (@allen_ai) 's Twitter Profile Photo

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.
Jacqueline He (@jcqln_h) 's Twitter Profile Photo

LMs often output answers that sound right but aren’t supported by input context. This is intrinsic hallucination: the generation of plausible, but unsupported content. We propose Precise Information Control (PIC): a task requiring LMs to ground only on given verifiable claims.

LMs often output answers that sound right but aren’t supported by input context. This is intrinsic hallucination: the generation of plausible, but unsupported content.

We propose Precise Information Control (PIC): a task requiring LMs to ground only on given verifiable claims.
Stella Li (@stellalisy) 's Twitter Profile Photo

Spurious Rewards was not all‼️We now present spurious PROMPTS🤔 check out our latest findings and discussion on evaluation: tinyurl.com/spurious-prompt. Who knew Lorem ipsum can bring 19.4% gains compared to default prompt👀 Also, arXiv is out🤩 arxiv.org/abs/2506.10947📄

Spurious Rewards was not all‼️We now present spurious PROMPTS🤔 check out our latest findings and discussion on evaluation: tinyurl.com/spurious-prompt.

Who knew Lorem ipsum can bring 19.4% gains compared to default prompt👀

Also, arXiv is out🤩 arxiv.org/abs/2506.10947📄