Quentin Lhoest 🤗 (@lhoestq) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Reasoning benchmarks (e.g., MMLU Pro and GPQA) have seen little benefit from naive RAG. But can we flip this? 🔥Introducing CompactDS: ✅Web-scale coverage ✅Runs with just 100GB RAM ✅Matches search engines The simplest RAG pipeline can even compete with agentic

thumb_up_off_alt52

chat_bubble_outline1

repeat16

shareShare

Christopher McMaster

@rheum_ai

a month ago

I have an alternative proposal.

thumb_up_off_alt312

chat_bubble_outline15

repeat36

shareShare

Daniel van Strien

@vanstriendaniel

a month ago

465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed! huggingface.co/blog/davanstri…

thumb_up_off_alt111

chat_bubble_outline4

repeat29

shareShare

Leandro von Werra

@lvwerra

a month ago

SmolLM3 is out - a smol, long context, multilingual reasoner! Along with the model we release the full engineering blueprint for building SoTA LLMs at that scale, so everybody can build on it! Check it out: hf.co/blog/smollm3

thumb_up_off_alt177

chat_bubble_outline5

repeat24

shareShare

clem 🤗

@clementdelangue

a month ago

Opening orders for Reachy Mini today, our open-source desktop robot for AI builders, starting at $299! Fully integrated with LeRobot & Hugging Face for the whole community to build AI apps for it (like this dancing one). We'll probably ship a first batch of a hundred this

thumb_up_off_alt578

chat_bubble_outline50

repeat90

shareShare

Loubna Ben Allal

@loubnabenallal1

a month ago

Introducing SmolTalk2: the dataset behind SmolLM3's dual reasoning. - mid-training → 5M samples - SFT data → 3M samples - preferences for APO → 500k samples It combines open datasets with new ones curated for strong think and no_think performance. hf.co/datasets/Huggi…

thumb_up_off_alt268

chat_bubble_outline5

repeat45

shareShare

Guilherme Penedo

@gui_penedo

23 days ago

Update: 🍷FineWeb and 📚 FineWeb-Edu now include English data from this year's CommonCrawl snapshots, covering Jan-Jun 2025. 🍷FineWeb now has 18.5 trillion tokens. We'll keep publishing timely updates to ensure your models have the latest world knowledge.

thumb_up_off_alt163

chat_bubble_outline3

repeat17

shareShare

Lukas Thede

@lukas_thede

22 days ago

🚨 Poster at #ICML2025! How can LLMs really keep up with the world? Come by E-2405 on July 15th (4:30–7:00pm) to check out WikiBigEdit – our new benchmark to test lifelong knowledge editing in LLMs at scale. 🔗 Real-world updates 📈 500k+ QA edits 🧠 Editing vs. RAG vs. CL

thumb_up_off_alt9

chat_bubble_outline1

repeat2

shareShare

Apache Spark

@apachespark

22 days ago

📣 Announcing: Apache Spark™ Python Data Source for Hugging Face AI Datasets! During this virtual event, you’ll learn how Apache Spark™ 4.x Python Data Source API allows Hugging Face to extend datasets for AI workloads. Why attend? ✅ 𝗗𝗶𝘀𝗰𝗼𝘃𝗲𝗿 𝘁𝗵𝗲 𝗹𝗮𝘁𝗲𝘀𝘁

📣 Announcing: Apache Spark™ Python Data Source for <a href="/huggingface/">Hugging Face</a> AI Datasets!

During this virtual event, you’ll learn how Apache Spark™ 4.x Python Data Source API allows Hugging Face to extend datasets for AI workloads.

Why attend?
✅ 𝗗𝗶𝘀𝗰𝗼𝘃𝗲𝗿 𝘁𝗵𝗲 𝗹𝗮𝘁𝗲𝘀𝘁

thumb_up_off_alt21

chat_bubble_outline2

repeat5

shareShare

Li Lyna Zhang

@lynazhang

22 days ago

🚀Our rStar-Coder dataset is now released! A verified dataset of 418K competition-level code problems, each with test cases of varying difficulty. On LiveCodeBench, it boosts Qwen2.5-14B from 23.3% → 62.5%, beating o3-mini (low) by +3.1%. Try it here: huggingface.co/datasets/micro…

thumb_up_off_alt208

chat_bubble_outline4

repeat41

shareShare