AllenNLP (@ai2_allennlp) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

How well do data-selection methods work for instruction-tuning at scale? Turns out, when you look at large, varied data pools, lots of recent methods lag behind simple baselines, and a simple embedding-based method (RDS) does best! More below ⬇️ (1/8)

thumb_up_off_alt332

chat_bubble_outline4

repeat63

shareShare

Nathan Lambert

@natolambert

4 months ago

A very exciting day for open-source AI! We're releasing our biggest open source model yet -- OLMo 2 32B -- and it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. As usual, all data, weights, code, etc. are available. For a long time,

thumb_up_off_alt956

chat_bubble_outline51

repeat155

shareShare

Nathan Lambert

@natolambert

4 months ago

My teammates Costa Huang and Hamish Ivison have uploaded intermediate checkpoints for our recent RL models at Ai2. Hopefully this helps seed some research into how RL finetuning is impacting the weights! As we move towards full reasoner models we'll continue this. Models with it: OLMo

thumb_up_off_alt67

chat_bubble_outline2

repeat3

shareShare

Ai2

@allen_ai

4 months ago

We submitted a recommendation to the Office of Science and Technology Policy encouraging them to prioritize a multi-stakeholder, open-source AI ecosystem. You can read our blog post and comment here: allenai.org/blog/OSTP

thumb_up_off_alt44

chat_bubble_outline0

repeat27

shareShare

Nathan Lambert

@natolambert

3 months ago

very fun to play with if you're an llm nerd -- something folks only in leading labs really have gotten to do over the last few years. Now you can look at data that could be what contributed to a completion.

thumb_up_off_alt169

chat_bubble_outline9

repeat14

shareShare

Ai2

@allen_ai

3 months ago

Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵

thumb_up_off_alt659

chat_bubble_outline11

repeat121

shareShare

Nathan Lambert

@natolambert

3 months ago

Heading to NAACL? With "verification being the key to AI" you should go to the poster session Friday, 9-10:30am to chat with my star colleagues Valentina Pyatkin + Jacob Morrison about RewardBench (and really RewardBench 2, evaluation, and reward models in post-training).

Heading to NAACL? With "verification being the key to AI" you should go to the poster session Friday, 9-10:30am to chat with my star colleagues <a href="/valentina__py/">Valentina Pyatkin</a> + <a href="/jacobcares/">Jacob Morrison</a> about RewardBench (and really RewardBench 2, evaluation, and reward models in post-training).

thumb_up_off_alt22

chat_bubble_outline1

repeat3

shareShare

Nathan Lambert

@natolambert

2 months ago

Stoked to get the 1B OLMo 2 model out -- this will likely be our most used model. Getting a 1B model we were happy with was a wandering path. I've written some lessons from training it below. Astute followers of AI releases should be a bit confused by why we are releasing a 1B

thumb_up_off_alt285

chat_bubble_outline13

repeat25

shareShare

Luca Soldaini ✈️ ICLR 25

@soldni

2 months ago

OLMo 2 model family is complete! capping it off with a very strong 1B model... perfect baseline for your next posttrain paper 😁

thumb_up_off_alt66

chat_bubble_outline0

repeat6

shareShare

Costa Huang

@vwxyzjn

2 months ago

🥘 Excited to share our latest OLMo 1 B models! Almost summer RL time. We did another two-stage RL: * The first RLVR run uses allenai/RLVR-GSM-MATH-IF-Mixed-Constraints * The final RLVR run uses huggingface.co/datasets/allen… for targeted MATH improvement Short 🧵

thumb_up_off_alt182

chat_bubble_outline3

repeat26

shareShare

Ai2

@allen_ai

2 months ago

The story of OLMo, our Open Language Model, goes back to February 2023 when a group of researchers gathered at Ai2 and started planning. What if we made a language model with state-of-the-art performance, but we did it completely in the open? 🧵

thumb_up_off_alt143

chat_bubble_outline2

repeat21

shareShare

Nathan Lambert

@natolambert

a month ago

Super excited that our second reward model evaluation is out. It's substantially harder, much cleaner, and well correlated with downstream PPO/BoN sampling. Happy hillclimbing! Huge congrats to Saumya Malik who lead the project with a total commitment to excellence.

thumb_up_off_alt117

chat_bubble_outline4

repeat10

shareShare

Ai2

@allen_ai

a month ago

As we’ve been working towards training a new version of OLMo, we wanted to improve our methods for measuring the Critical Batch Size (CBS) of a training run, to unlock greater efficiency, but we found gaps between the methods in the literature and our practical needs for training

thumb_up_off_alt203

chat_bubble_outline3

repeat22

shareShare

Jiacheng Liu

@liujc1998

a month ago

We enabled OLMoTrace for Tülu 3 models! 🤠 Matched spans are shorter than for OLMo models, bc we can only search in Tülu's post-training data (base model is Llama). Yet we thought it'd still bring some value. Try yourself on the Ai2 playground -- playground.allenai.org

thumb_up_off_alt42

chat_bubble_outline2

repeat12

shareShare

Ai2

@allen_ai

a month ago

New updates for olmOCR, our fully open toolkit for transforming documents (PDFs & images) into clean markdown. We released: 1️⃣ New benchmark for fair comparison of OCR engines and APIs 2️⃣ Improved inference that is faster and cheaper to run 3️⃣ Docker image for easy deployment

thumb_up_off_alt286

chat_bubble_outline7

repeat40

shareShare

Nouha Dziri

@nouhadziri

22 days ago

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

thumb_up_off_alt714

chat_bubble_outline22

repeat157

shareShare

Ai2

@allen_ai

13 days ago

Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵

thumb_up_off_alt314

chat_bubble_outline3

repeat49

shareShare

Valentina Pyatkin

@valentina__py

13 days ago

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.