
Guilherme Penedo
@gui_penedo
Pre-training data @huggingface 🤗. Lisboeta 🇵🇹
ID: 547836893
07-04-2012 19:07:52
914 Tweet
3,3K Followers
2,2K Following




Finally had a bit of time to jot down some thoughts on this solid, open data engineering work from Essential AI. This work brings Essential-Web, a 24T-token pre-training corpus, to the open-source community. I've always appreciated open-source research, as it can significantly
