AllenNLP (@ai2_allennlp) 's Twitter Profile
AllenNLP

@ai2_allennlp

The AllenNLP team works on language-centered AI that equitably serves humanity. We deliver high-impact research and open-source tools to accelerate progress.

ID: 1026903001431138304

linkhttps://allenai.org/allennlp calendar_today07-08-2018 18:48:49

264 Tweet

14,14K Followers

38 Following

Hamish Ivison (@hamishivi) 's Twitter Profile Photo

How well do data-selection methods work for instruction-tuning at scale? Turns out, when you look at large, varied data pools, lots of recent methods lag behind simple baselines, and a simple embedding-based method (RDS) does best! More below ⬇️ (1/8)

How well do data-selection methods work for instruction-tuning at scale?

Turns out, when you look at large, varied data pools, lots of recent methods lag behind simple baselines, and a simple embedding-based method (RDS) does best!

More below ⬇️ (1/8)
Nathan Lambert (@natolambert) 's Twitter Profile Photo

A very exciting day for open-source AI! We're releasing our biggest open source model yet -- OLMo 2 32B -- and it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. As usual, all data, weights, code, etc. are available. For a long time,

A very exciting day for open-source AI! We're releasing our biggest open source model yet -- OLMo 2 32B -- and it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. As usual, all data, weights, code, etc. are available.

For a long time,
Nathan Lambert (@natolambert) 's Twitter Profile Photo

My teammates Costa Huang and Hamish Ivison have uploaded intermediate checkpoints for our recent RL models at Ai2. Hopefully this helps seed some research into how RL finetuning is impacting the weights! As we move towards full reasoner models we'll continue this. Models with it: OLMo

Ai2 (@allen_ai) 's Twitter Profile Photo

We submitted a recommendation to the Office of Science and Technology Policy encouraging them to prioritize a multi-stakeholder, open-source AI ecosystem. You can read our blog post and comment here: allenai.org/blog/OSTP

Nathan Lambert (@natolambert) 's Twitter Profile Photo

very fun to play with if you're an llm nerd -- something folks only in leading labs really have gotten to do over the last few years. Now you can look at data that could be what contributed to a completion.

Ai2 (@allen_ai) 's Twitter Profile Photo

Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵

Ever wonder how LLM developers choose their pretraining data? It’s not guesswork— all AI labs create small-scale models as experiments, but the models and their data are rarely shared.
DataDecide opens up the process: 1,050 models, 30k checkpoints, 25 datasets & 10 benchmarks 🧵
Nathan Lambert (@natolambert) 's Twitter Profile Photo

Heading to NAACL? With "verification being the key to AI" you should go to the poster session Friday, 9-10:30am to chat with my star colleagues Valentina Pyatkin + Jacob Morrison about RewardBench (and really RewardBench 2, evaluation, and reward models in post-training).

Heading to NAACL? With "verification being the key to AI" you should go to the poster session Friday, 9-10:30am to chat with my star colleagues <a href="/valentina__py/">Valentina Pyatkin</a> + <a href="/jacobcares/">Jacob Morrison</a> about RewardBench (and really RewardBench 2, evaluation, and reward models in post-training).
Nathan Lambert (@natolambert) 's Twitter Profile Photo

Stoked to get the 1B OLMo 2 model out -- this will likely be our most used model. Getting a 1B model we were happy with was a wandering path. I've written some lessons from training it below. Astute followers of AI releases should be a bit confused by why we are releasing a 1B

Stoked to get the 1B OLMo 2 model out -- this will likely be our most used model. Getting a 1B model we were happy with was a wandering path. I've written some lessons from training it below.

Astute followers of AI releases should be a bit confused by why we are releasing a 1B
Luca Soldaini ✈️ ICLR 25 (@soldni) 's Twitter Profile Photo

OLMo 2 model family is complete! capping it off with a very strong 1B model... perfect baseline for your next posttrain paper 😁

Costa Huang (@vwxyzjn) 's Twitter Profile Photo

🥘 Excited to share our latest OLMo 1 B models! Almost summer RL time. We did another two-stage RL: * The first RLVR run uses allenai/RLVR-GSM-MATH-IF-Mixed-Constraints * The final RLVR run uses huggingface.co/datasets/allen… for targeted MATH improvement Short 🧵

🥘 Excited to share our latest OLMo 1 B models! Almost summer RL time. We did another two-stage RL: 
* The first RLVR run uses allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
* The final RLVR run uses huggingface.co/datasets/allen… for targeted MATH improvement

Short 🧵
Ai2 (@allen_ai) 's Twitter Profile Photo

The story of OLMo, our Open Language Model, goes back to February 2023 when a group of researchers gathered at Ai2 and started planning. What if we made a language model with state-of-the-art performance, but we did it completely in the open? 🧵

Nathan Lambert (@natolambert) 's Twitter Profile Photo

Super excited that our second reward model evaluation is out. It's substantially harder, much cleaner, and well correlated with downstream PPO/BoN sampling. Happy hillclimbing! Huge congrats to Saumya Malik who lead the project with a total commitment to excellence.

Ai2 (@allen_ai) 's Twitter Profile Photo

As we’ve been working towards training a new version of OLMo, we wanted to improve our methods for measuring the Critical Batch Size (CBS) of a training run, to unlock greater efficiency, but we found gaps between the methods in the literature and our practical needs for training

As we’ve been working towards training a new version of OLMo, we wanted to improve our methods for measuring the Critical Batch Size (CBS) of a training run, to unlock greater efficiency, but we found gaps between the methods in the literature and our practical needs for training
Jiacheng Liu (@liujc1998) 's Twitter Profile Photo

We enabled OLMoTrace for Tülu 3 models! 🤠 Matched spans are shorter than for OLMo models, bc we can only search in Tülu's post-training data (base model is Llama). Yet we thought it'd still bring some value. Try yourself on the Ai2 playground -- playground.allenai.org

We enabled OLMoTrace for Tülu 3 models! 🤠

Matched spans are shorter than for OLMo models, bc we can only search in Tülu's post-training data (base model is Llama). Yet we thought it'd still bring some value.

Try yourself on the Ai2 playground -- playground.allenai.org
Ai2 (@allen_ai) 's Twitter Profile Photo

New updates for olmOCR, our fully open toolkit for transforming documents (PDFs & images) into clean markdown. We released: 1️⃣ New benchmark for fair comparison of OCR engines and APIs 2️⃣ Improved inference that is faster and cheaper to run 3️⃣ Docker image for easy deployment

New updates for olmOCR, our fully open toolkit for transforming documents (PDFs &amp; images) into clean markdown. We released:

1️⃣ New benchmark for fair comparison of OCR engines and APIs
2️⃣ Improved inference that is faster and cheaper to run
3️⃣ Docker image for easy deployment
Nouha Dziri (@nouhadziri) 's Twitter Profile Photo

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? 

Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬

 We built a benchmark to find out → OMEGA Ω 📐

💥 We found
Ai2 (@allen_ai) 's Twitter Profile Photo

Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵

Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵
Valentina Pyatkin (@valentina__py) 's Twitter Profile Photo

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR. But the set of constraints and verifier functions is limited and most models overfit on IFEval. We introduce IFBench to measure model generalization to unseen constraints.

💡Beyond math/code, instruction following with verifiable constraints is suitable to be learned with RLVR.
But the set of constraints and verifier functions is limited and most models overfit on IFEval.
We introduce IFBench to measure model generalization to unseen constraints.
Ai2 (@allen_ai) 's Twitter Profile Photo

Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵