Alon Albalak (@albalakalon) 's Twitter Profile
Alon Albalak

@albalakalon

Driving open-science and data-centric AI research @synth_labs

Previously: PhD @ucsbNLP, internships @MSFTResearch & @AIatMeta.

All puns are my own

ID: 1333847197570408448

linkhttp://alon-albalak.github.io calendar_today01-12-2020 18:55:37

844 Tweet

1,1K Followers

572 Following

EleutherAI (@aieleuther) 's Twitter Profile Photo

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

Can you train a performant language models without using unlicensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2
Alon Albalak (@albalakalon) 's Twitter Profile Photo

This is a big step towards openness, congrats to everyone 🎉 I'm happy to have had the opportunity to contribute to such a meaningful project! A bit disappointed that the WaPo article is behind a paywall though 😢 washingtonpost.com/politics/2025/…

Stella Biderman (@blancheminerva) 's Twitter Profile Photo

Two years in the making, we finally have 8 TB of openly licensed data with document-level metadata for authorship attribution, licensing details, links to original copies, and more. Hugely proud of the entire team.

Alfonso Amayuelas (@alfonamayuelas) 's Twitter Profile Photo

New paper 🚨📜🚀 Introducing “Agents of Change: Self-Evolving LLM Agents for Strategic Planning”! In this work, we show how LLM-powered agents can rewrite their own prompts & code to climb the learning curve in the board game Settlers of Catan 🎲 🧵👇

New paper 🚨📜🚀
Introducing “Agents of Change: Self-Evolving LLM Agents for Strategic Planning”!
In this work, we show how LLM-powered agents  can rewrite their own prompts & code to climb the learning curve in the board game Settlers of Catan 🎲
🧵👇
Shayne Longpre (@shayneredford) 's Twitter Profile Photo

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 ! Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by Nikhil Kandpal Brian Lester Colin Raffel. 📜: arxiv.org/pdf/2506.05209 📚🤖 Data & models: huggingface.co/common-pile 1/

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 !

Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by <a href="/kandpal_nikhil/">Nikhil Kandpal</a> <a href="/blester125/">Brian Lester</a> <a href="/colinraffel/">Colin Raffel</a>.

📜: arxiv.org/pdf/2506.05209
 📚🤖 Data &amp; models: huggingface.co/common-pile
1/
Alon Albalak (@albalakalon) 's Twitter Profile Photo

Benchmarks like this are an incredibly important aspect to continued progress in AI research, but it's SO hard to come up with good benchmarks! Great work Minqi Jiang and team🙌🎉

Alon Albalak (@albalakalon) 's Twitter Profile Photo

Wow, I'm so glad people are doing research like this! Main takeaway: 💠LLM-generated research ideas **sound** good, but when the ideas are implemented, they don't work out as well as human-generated ideas Still lots of work to do on hypothesis generation for LLMs🧑‍🔬

Machine Learning Street Talk (@mlstreettalk) 's Twitter Profile Photo

AI is so smart, why are its internals 'spaghetti'? We spoke with Kenneth Stanley and Akarsh Kumar (MIT) about their new paper: Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis. Co-authors: Jeff Clune Joel Lehman

Alon Albalak (@albalakalon) 's Twitter Profile Photo

Awesome work by Weijia Shi @ ICML Sewon Min and the rest of the team! I love this direction of research 😍 moving towards collaborative model development and deployment!

Richard Socher (@richardsocher) 's Twitter Profile Photo

Everyone's chasing AI talent from the obvious places. But here's the thing: you can't build another OpenAI by just building an LLM. If you want to replicate OpenAI's success, maybe don't just go after the latest employees from top labs. Go a step further and look at who trained

Ai2 (@allen_ai) 's Twitter Profile Photo

Great science starts with great questions. 🤔✨ Meet AutoDS—an AI that doesn’t just hunt for answers, it decides which questions are worth asking. 🧵

Great science starts with great questions. 🤔✨ Meet AutoDS—an AI that doesn’t just hunt for answers, it decides which questions are worth asking. 🧵
Joel Simon (@_joelsimon) 's Twitter Profile Photo

Excited to share that I'm joining Ken's lab at Lila to research open-endedness and explore the future of human/agent collaborative systems for science, creativity and discovery! 🧪🤖

Richard Suwandi @ICLR2025 (@richardcsuwandi) 's Twitter Profile Photo

We’re training AI on everything that we know, but what about things that we don’t know? At #ICML2025, the EXAIT Workshop sparked a crucial conversation: as AI systems grow more powerful, they're relying less on genuine exploration and more on curated human data. This shortcut

We’re training AI on everything that we know, but what about things that we don’t know? 

At #ICML2025, the EXAIT Workshop sparked a crucial conversation: as AI systems grow more powerful, they're relying less on genuine exploration and more on curated human data. This shortcut