Clémentine Fourrier 🍊 (@clefourrier) 's Twitter Profile
Clémentine Fourrier 🍊

@clefourrier

Evals @HuggingFace 🐍✨

"The future is already here, it’s just not very evenly distributed" (Gibson)

ID: 1188812448767336449

linkhttp://clefourrier.github.io calendar_today28-10-2019 13:39:51

3,3K Tweet

5,5K Followers

378 Following

Maziyar PANAHI (@maziyarpanahi) 's Twitter Profile Photo

🚀 Big news in healthcare AI! I'm thrilled to announce the launch of OpenMed on Hugging Face, releasing 380+ state-of-the-art medical NER models for free under Apache 2.0. And this is just the beginning! 🧵

🚀 Big news in healthcare AI! I'm thrilled to announce the launch of OpenMed on <a href="/huggingface/">Hugging Face</a>, releasing 380+ state-of-the-art medical NER models for free under Apache 2.0. 

And this is just the beginning! 🧵
Adrien Carreira (@xcid_) 's Twitter Profile Photo

Starting today you can run any of the 100K+ GGUFs on Hugging Face directly with Docker Run! All of it one single line: docker model run hf.co/bartowski/Llam… Excited to see how y'all will use it

Starting today you can run any of the 100K+ GGUFs on Hugging Face directly with Docker Run! 

All of it one single line: docker model run hf.co/bartowski/Llam…

Excited to see how y'all will use it
Clémentine Fourrier 🍊 (@clefourrier) 's Twitter Profile Photo

Can LLMs predict the future? In FutureBench, friends from Together AI create new questions from evolving news & markets: As time passes, we'll see which agents are the best at predicting events that have yet to happen! 🔮 Also cool: by design, dynamic & uncontaminated eval

Can LLMs predict the future?

In FutureBench, friends from <a href="/togethercompute/">Together AI</a> create new questions from evolving news &amp; markets:
As time passes, we'll see which agents are the best at predicting events that have yet to happen! 🔮

Also cool: by design, dynamic &amp; uncontaminated eval
Together AI (@togethercompute) 's Twitter Profile Photo

Most AI benchmarks test the past. But real intelligence is about predicting the future. Introducing FutureBench — a new benchmark for evaluating agents on real forecasting tasks that we developed with Hugging Face 🔍 Reasoning > memorization 📊 Real-world events 🧠 Dynamic,

Most AI benchmarks test the past.

But real intelligence is about predicting the future.

Introducing FutureBench — a new benchmark for evaluating agents on real forecasting tasks that we developed with <a href="/huggingface/">Hugging Face</a> 

🔍 Reasoning &gt; memorization
📊 Real-world events
🧠 Dynamic,
steven (@tu7uruu) 's Twitter Profile Photo

Just dropped on the Open ASR Leaderboard: Canary-Qwen-2.5, the latest and first-of-its-kind ASR model from the NVIDIA NeMo team. > Ranked #1 on the Open ASR Leaderboard with a WER of just 5.63% > Blazing fast with RTFx=418 on an A100 GPU for a 2.5b model! > Released under a

Just dropped on the Open ASR Leaderboard: Canary-Qwen-2.5, the latest and first-of-its-kind ASR model from the NVIDIA NeMo team.

&gt; Ranked #1 on the Open ASR Leaderboard with a WER of just 5.63%
&gt; Blazing fast with RTFx=418 on an A100 GPU for a 2.5b model!
&gt; Released under a
ARC Prize (@arcprize) 's Twitter Profile Photo

Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI We’re releasing: * 3 games (environments) * $10K agent contest * AI agents API Starting scores - Frontier AI: 0%, Humans: 100%

Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI

We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API

Starting scores - Frontier AI: 0%, Humans: 100%
ARC Prize (@arcprize) 's Twitter Profile Photo

ARC-AGI-3 Preview games need to be pressure tested. We’re hosting a 30-day agent competition in partnership with Hugging Face We’re calling on the community to build agents (and win money!) arcprize.org/competitions/a…

ARC-AGI-3 Preview games need to be pressure tested. We’re hosting a 30-day agent competition in partnership with <a href="/huggingface/">Hugging Face</a>

We’re calling on the community to build agents (and win money!)

arcprize.org/competitions/a…
Mikhail Samin (@mihonarium) 's Twitter Profile Photo

🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony. According to a Coordinator on Problem 6, the one problem OpenAI

🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.

According to a Coordinator on Problem 6, the one problem OpenAI
Lewis Tunstall (@_lewtun) 's Twitter Profile Photo

An under appreciated fact about using formal methods like Lean is that it enables large-scale *collaboration* among mathematicians & potentially future AI agents. Why? Well, you can decompose a large proof into separate components that can be proven independently with robust

An under appreciated fact about using formal methods like Lean is that it enables large-scale *collaboration* among mathematicians &amp; potentially future AI agents.

Why? Well, you can decompose a large proof into separate components that can be proven independently with robust
Georgia Channing (@cgeorgiaw) 's Twitter Profile Photo

data of the day: just dropped a big snapshot of polar elevation data on Hugging Face. 1000s of TIFFs and metadata to 32m resolution perfect for climate research, mapping, and geospatial modeling check it out: huggingface.co/datasets/cgeor… if people like this data, maybe i'll make a

data of the day:

just dropped a big snapshot of polar elevation data on <a href="/huggingface/">Hugging Face</a>. 1000s of TIFFs and metadata to 32m resolution perfect for climate research, mapping, and geospatial modeling

check it out: huggingface.co/datasets/cgeor…

if people like this data, maybe i'll make a
Georgia Channing (@cgeorgiaw) 's Twitter Profile Photo

very proud that my work on multi-agent debate for misinformation detection won best paper award at the ICML Conference CFAgentic workshop! check it out on arxiv: arxiv.org/abs/2410.20140 v grateful to all my co-authors and the support from BBC Research & Development 🥳

very proud that my work on multi-agent debate for misinformation detection won best paper award at the <a href="/icmlconf/">ICML Conference</a> CFAgentic workshop! 

check it out on arxiv: arxiv.org/abs/2410.20140

v grateful to all my co-authors and the support from <a href="/BBCRD/">BBC Research & Development</a> 🥳
Loubna Ben Allal (@loubnabenallal1) 's Twitter Profile Photo

500k samples of multilingual post-training data in 5 languages: French, Spanish, Italian, German and Portuguese. To address the lack of multilingual post-training datasets, we created these samples and found they improve performance on benchmarks like Global MMLU, Belebele, and

500k samples of multilingual post-training data in 5 languages: French, Spanish, Italian, German and Portuguese.

To address the lack of multilingual post-training datasets, we created these samples and found they improve performance on benchmarks like Global MMLU, Belebele, and
Greg Kamradt (@gregkamradt) 's Twitter Profile Photo

Anyone have a connection at Qwen? Trying to reproduce the results on ARC Prize and getting different metrics Want to get a hold of them and find out how they tested

tokenbender (@tokenbender) 's Twitter Profile Photo

signatures to look for in ai writing - > "it isn't just x, it is y" > narrative-philosophical-poetic section headings "The XYZ - A Journey of ABC" > overuse of symbolism and lofty adjectives - "stands as a testament", "plays a vital role", "underscores its importance" >

Wolfram Ravenwolf (@wolframrvnwlf) 's Twitter Profile Photo

I'm now using Qwen3-Coder in Claude Code. Works with any model actually, but this is surely the best one currently. There are a bunch of proxies on GitHub that make this possible, but none worked well enough for me, so I implemented this myself using LiteLLM. Guide in comments:

I'm now using Qwen3-Coder in Claude Code. Works with any model actually, but this is surely the best one currently.

There are a bunch of proxies on GitHub that make this possible, but none worked well enough for me, so I implemented this myself using LiteLLM.

Guide in comments: