Guilherme Penedo (@gui_penedo) 's Twitter Profile
Guilherme Penedo

@gui_penedo

Pre-training data @huggingface 🤗. Lisboeta 🇵🇹

ID: 547836893

calendar_today07-04-2012 19:07:52

914 Tweet

3,3K Followers

2,2K Following

Dana Aubakirova (@daubakirovaa) 's Twitter Profile Photo

Today, we are introducing SmolVLA: a 450M open-source vision-language action model. Best-in-class performance and inference speed! And the best part? We trained it using all the open-source LeRobot datasets in the Hugging Face hub! But how? 🫳🏀

EleutherAI (@aieleuther) 's Twitter Profile Photo

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

Can you train a performant language models without using unlicensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2
Guilherme Penedo (@gui_penedo) 's Twitter Profile Photo

Very happy to have played a (very small) part in the release of this very large fully open dataset. We finally have an answer to the question: "how good a model can we get with fully permissible data"? Turns out, not bad at all

Sinclair Wang (@sinclairwang1) 's Twitter Profile Photo

Finally had a bit of time to jot down some thoughts on this solid, open data engineering work from Essential AI. This work brings Essential-Web, a 24T-token pre-training corpus, to the open-source community. I've always appreciated open-source research, as it can significantly

Finally had a bit of time to jot down some thoughts on this solid, open data engineering work from <a href="/essential_ai/">Essential AI</a>.

This work brings Essential-Web, a 24T-token pre-training corpus, to the open-source community. I've always appreciated open-source research, as it can significantly