Alon Albalak (@albalakalon) Twitter Tweets • TwiCopy

Alon Albalak

@albalakalon

+ Follow

Driving open-science and data-centric AI research @synth_labs

Previously: PhD @ucsbNLP, internships @MSFTResearch & @AIatMeta.

All puns are my own

ID: 1333847197570408448

linkhttp://alon-albalak.github.io calendar_today01-12-2020 18:55:37

844 Tweet

1,1K Followers

572 Following

EleutherAI

@aieleuther

6 months ago

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

thumb_up_off_alt556

chat_bubble_outline10

repeat127

shareShare

Alon Albalak

@albalakalon

6 months ago

This is a big step towards openness, congrats to everyone 🎉 I'm happy to have had the opportunity to contribute to such a meaningful project! A bit disappointed that the WaPo article is behind a paywall though 😢 washingtonpost.com/politics/2025/…

thumb_up_off_alt17

chat_bubble_outline2

repeat2

shareShare

Stella Biderman

@blancheminerva

6 months ago

Two years in the making, we finally have 8 TB of openly licensed data with document-level metadata for authorship attribution, licensing details, links to original copies, and more. Hugely proud of the entire team.

thumb_up_off_alt551

chat_bubble_outline18

repeat64

shareShare

Alfonso Amayuelas

@alfonamayuelas

6 months ago

New paper 🚨📜🚀 Introducing “Agents of Change: Self-Evolving LLM Agents for Strategic Planning”! In this work, we show how LLM-powered agents can rewrite their own prompts & code to climb the learning curve in the board game Settlers of Catan 🎲 🧵👇

thumb_up_off_alt295

chat_bubble_outline4

repeat68

shareShare

Shayne Longpre

@shayneredford

5 months ago

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 ! Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by Nikhil Kandpal Brian Lester Colin Raffel. 📜: arxiv.org/pdf/2506.05209 📚🤖 Data & models: huggingface.co/common-pile 1/

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 !

Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by <a href="/kandpal_nikhil/">Nikhil Kandpal</a> <a href="/blester125/">Brian Lester</a> <a href="/colinraffel/">Colin Raffel</a>.

📜: arxiv.org/pdf/2506.05209
📚🤖 Data & models: huggingface.co/common-pile
1/

thumb_up_off_alt58

chat_bubble_outline2

repeat14

shareShare

Alon Albalak

@albalakalon

5 months ago

Benchmarks like this are an incredibly important aspect to continued progress in AI research, but it's SO hard to come up with good benchmarks! Great work Minqi Jiang and team🙌🎉

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Alon Albalak

@albalakalon

5 months ago

Wow, I'm so glad people are doing research like this! Main takeaway: 💠LLM-generated research ideas **sound** good, but when the ideas are implemented, they don't work out as well as human-generated ideas Still lots of work to do on hypothesis generation for LLMs🧑‍🔬

thumb_up_off_alt14

chat_bubble_outline0

repeat0

shareShare

Machine Learning Street Talk

@mlstreettalk

5 months ago

AI is so smart, why are its internals 'spaghetti'? We spoke with Kenneth Stanley and Akarsh Kumar (MIT) about their new paper: Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis. Co-authors: Jeff Clune Joel Lehman

thumb_up_off_alt278

chat_bubble_outline19

repeat63

shareShare

Alon Albalak

@albalakalon

5 months ago

Awesome work by Weijia Shi @ ICML Sewon Min and the rest of the team! I love this direction of research 😍 moving towards collaborative model development and deployment!

thumb_up_off_alt15

chat_bubble_outline1

repeat3

shareShare

Richard Socher

@richardsocher

4 months ago

Everyone's chasing AI talent from the obvious places. But here's the thing: you can't build another OpenAI by just building an LLM. If you want to replicate OpenAI's success, maybe don't just go after the latest employees from top labs. Go a step further and look at who trained

thumb_up_off_alt97

chat_bubble_outline6

repeat16

shareShare

Ai2

@allen_ai

4 months ago

Great science starts with great questions. 🤔✨ Meet AutoDS—an AI that doesn’t just hunt for answers, it decides which questions are worth asking. 🧵

thumb_up_off_alt332

chat_bubble_outline2

repeat38

shareShare

Joel Simon

@_joelsimon

4 months ago

Excited to share that I'm joining Ken's lab at Lila to research open-endedness and explore the future of human/agent collaborative systems for science, creativity and discovery! 🧪🤖

thumb_up_off_alt93

chat_bubble_outline13

repeat2

shareShare

Richard Suwandi @ICLR2025

@richardcsuwandi

4 months ago

We’re training AI on everything that we know, but what about things that we don’t know? At #ICML2025, the EXAIT Workshop sparked a crucial conversation: as AI systems grow more powerful, they're relying less on genuine exploration and more on curated human data. This shortcut

thumb_up_off_alt71

chat_bubble_outline2

repeat17

shareShare

Alon Albalak

@albalakalon

4 months ago

You should all go work with Roberta Raileanu she's amazing!!

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare