Guilherme Penedo (@gui_penedo) Twitter Tweets • TwiCopy

Guilherme Penedo

@gui_penedo

+ Follow

Pre-training data @huggingface 🤗. Lisboeta 🇵🇹

ID: 547836893

calendar_today07-04-2012 19:07:52

914 Tweet

3,3K Followers

2,2K Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Today, we are introducing SmolVLA: a 450M open-source vision-language action model. Best-in-class performance and inference speed! And the best part? We trained it using all the open-source LeRobot datasets in the Hugging Face hub! But how? 🫳🏀

thumb_up_off_alt476

chat_bubble_outline9

repeat87

shareShare

EleutherAI

@aieleuther

a month ago

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

thumb_up_off_alt556

chat_bubble_outline10

repeat127

shareShare

Guilherme Penedo

@gui_penedo

a month ago

Very happy to have played a (very small) part in the release of this very large fully open dataset. We finally have an answer to the question: "how good a model can we get with fully permissible data"? Turns out, not bad at all

thumb_up_off_alt24

chat_bubble_outline1

repeat5

shareShare

Sinclair Wang

@sinclairwang1

a month ago

Finally had a bit of time to jot down some thoughts on this solid, open data engineering work from Essential AI. This work brings Essential-Web, a 24T-token pre-training corpus, to the open-source community. I've always appreciated open-source research, as it can significantly

Finally had a bit of time to jot down some thoughts on this solid, open data engineering work from <a href="/essential_ai/">Essential AI</a>.

This work brings Essential-Web, a 24T-token pre-training corpus, to the open-source community. I've always appreciated open-source research, as it can significantly

thumb_up_off_alt82

chat_bubble_outline2

repeat15

shareShare

Guilherme Penedo

Gate.io

Dana Aubakirova

EleutherAI

Guilherme Penedo

Sinclair Wang