Quentin Lhoest 🤗 (@lhoestq) 's Twitter Profile
Quentin Lhoest 🤗

@lhoestq

Datasets @huggingface | Open Source + HF Dataset Hub

ID: 1846655912

linkhttps://huggingface.co/lhoestq calendar_today09-09-2013 14:26:34

1,1K Tweet

3,3K Followers

275 Following

Xinxi Lyu (@xinxilyu) 's Twitter Profile Photo

Reasoning benchmarks (e.g., MMLU Pro and GPQA) have seen little benefit from naive RAG. But can we flip this? 🔥Introducing CompactDS: ✅Web-scale coverage ✅Runs with just 100GB RAM ✅Matches search engines The simplest RAG pipeline can even compete with agentic

Daniel van Strien (@vanstriendaniel) 's Twitter Profile Photo

465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed! huggingface.co/blog/davanstri…

Leandro von Werra (@lvwerra) 's Twitter Profile Photo

SmolLM3 is out - a smol, long context, multilingual reasoner! Along with the model we release the full engineering blueprint for building SoTA LLMs at that scale, so everybody can build on it! Check it out: hf.co/blog/smollm3

SmolLM3 is out - a smol, long context, multilingual reasoner! 

Along with the model we release the full engineering blueprint for building SoTA LLMs at that scale, so everybody can build on it!

Check it out: hf.co/blog/smollm3
clem 🤗 (@clementdelangue) 's Twitter Profile Photo

Opening orders for Reachy Mini today, our open-source desktop robot for AI builders, starting at $299! Fully integrated with LeRobot & Hugging Face for the whole community to build AI apps for it (like this dancing one). We'll probably ship a first batch of a hundred this

Loubna Ben Allal (@loubnabenallal1) 's Twitter Profile Photo

Introducing SmolTalk2: the dataset behind SmolLM3's dual reasoning. - mid-training → 5M samples - SFT data → 3M samples - preferences for APO → 500k samples It combines open datasets with new ones curated for strong think and no_think performance. hf.co/datasets/Huggi…

Introducing SmolTalk2: the dataset behind SmolLM3's dual reasoning.

- mid-training → 5M samples
- SFT data → 3M samples
- preferences for APO → 500k samples

It combines open datasets with new ones curated for strong think and no_think performance.
hf.co/datasets/Huggi…
Guilherme Penedo (@gui_penedo) 's Twitter Profile Photo

Update: 🍷FineWeb and 📚 FineWeb-Edu now include English data from this year's CommonCrawl snapshots, covering Jan-Jun 2025. 🍷FineWeb now has 18.5 trillion tokens. We'll keep publishing timely updates to ensure your models have the latest world knowledge.

Update: 🍷FineWeb and 📚 FineWeb-Edu now include English data from this year's CommonCrawl snapshots, covering Jan-Jun 2025.

🍷FineWeb now has 18.5 trillion tokens.

We'll keep publishing timely updates to ensure your models have the latest world knowledge.
Lukas Thede (@lukas_thede) 's Twitter Profile Photo

🚨 Poster at #ICML2025! How can LLMs really keep up with the world? Come by E-2405 on July 15th (4:30–7:00pm) to check out WikiBigEdit – our new benchmark to test lifelong knowledge editing in LLMs at scale. 🔗 Real-world updates 📈 500k+ QA edits 🧠 Editing vs. RAG vs. CL

🚨 Poster at #ICML2025!
How can LLMs really keep up with the world?

Come by E-2405 on July 15th (4:30–7:00pm) to check out WikiBigEdit – our new benchmark to test lifelong knowledge editing in LLMs at scale.

🔗 Real-world updates
📈 500k+ QA edits
🧠 Editing vs. RAG vs. CL
Apache Spark (@apachespark) 's Twitter Profile Photo

📣 Announcing: Apache Spark™ Python Data Source for Hugging Face AI Datasets! During this virtual event, you’ll learn how Apache Spark™ 4.x Python Data Source API allows Hugging Face to extend datasets for AI workloads. Why attend? ✅ 𝗗𝗶𝘀𝗰𝗼𝘃𝗲𝗿 𝘁𝗵𝗲 𝗹𝗮𝘁𝗲𝘀𝘁

📣 Announcing: Apache Spark™ Python Data Source for <a href="/huggingface/">Hugging Face</a> AI Datasets!

During this virtual event, you’ll learn how Apache Spark™ 4.x Python Data Source API allows Hugging Face to extend datasets for AI workloads.

Why attend?
✅ 𝗗𝗶𝘀𝗰𝗼𝘃𝗲𝗿 𝘁𝗵𝗲 𝗹𝗮𝘁𝗲𝘀𝘁
Li Lyna Zhang (@lynazhang) 's Twitter Profile Photo

🚀Our rStar-Coder dataset is now released! A verified dataset of 418K competition-level code problems, each with test cases of varying difficulty. On LiveCodeBench, it boosts Qwen2.5-14B from 23.3% → 62.5%, beating o3-mini (low) by +3.1%. Try it here: huggingface.co/datasets/micro…