Carlo (@carlobaronio) 's Twitter Profile
Carlo

@carlobaronio

fine-tuning math & physics @stanford

ID: 1752197361072496640

calendar_today30-01-2024 05:10:08

61 Tweet

25 Followers

101 Following

Inception Labs (@inceptionailabs) 's Twitter Profile Photo

We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation.

Jia Li (@jiali52524397) 's Twitter Profile Photo

We believe formal math is the future. 🔥Introducing Kimina-Prover Preview, a Numina & Kimi.ai collaboration, the first large formal reasoning model for Lean 4, achieving 80.78% miniF2F. github.com/MoonshotAI/Kim…

We believe formal math is the future.
🔥Introducing Kimina-Prover Preview, a Numina &
<a href="/Kimi_Moonshot/">Kimi.ai</a>  collaboration, the first large formal reasoning model for Lean 4, achieving 80.78% miniF2F.
github.com/MoonshotAI/Kim…
Jia Li (@jiali52524397) 's Twitter Profile Photo

Combinatorics are the two last problems unsolved by AlphaProof at last year's IMO。 Introducing CombiBench Kimi.ai , a benchmark focusing on combinatorics problems ! 🔥 🏆moonshotai.github.io/CombiBench/ 📘Dataset -> huggingface.co/datasets/AI-MO…

Combinatorics are the two last problems unsolved by AlphaProof at last year's IMO。
Introducing CombiBench <a href="/Kimi_Moonshot/">Kimi.ai</a> , a benchmark focusing on combinatorics problems ! 🔥
🏆moonshotai.github.io/CombiBench/
📘Dataset -&gt; huggingface.co/datasets/AI-MO…
Zhihong Shao (@zhs05232838) 's Twitter Profile Photo

We just released DeepSeek-Prover V2. - Solves nearly 90% of miniF2F problems - Significantly improves the SoTA performance on the PutnamBench - Achieves a non-trivial pass rate on AIME 24 & 25 problems in their formal version Github: github.com/deepseek-ai/De…

We just released DeepSeek-Prover V2.
- Solves nearly 90% of miniF2F problems
- Significantly improves the SoTA performance on the PutnamBench
- Achieves a non-trivial pass rate on AIME 24 &amp; 25 problems in their formal version

Github: github.com/deepseek-ai/De…
Carlo (@carlobaronio) 's Twitter Profile Photo

Had fun training Kevin! We explored multi-turn training to help models learn longer-horizon dynamics, and kernel generation seemed a very nice environment to try out our ideas—a step closer to coding agents! 🚀 It turns out that maybe you don't need insanely long context...

Had fun training Kevin! 
We explored multi-turn training to help models learn longer-horizon dynamics, and kernel generation seemed a very nice environment to try out our ideas—a step closer to coding agents! 🚀 
It turns out that maybe you don't need insanely long context...
vLLM (@vllm_project) 's Twitter Profile Photo

Great work! We love how vLLM is used in the rollout process with with offloading the engine to CPU and give the GPU back to the kernel to be benchmarked! This is a small feature we implemented to make RLHF smoother with vLLM.

Great work! We love how <a href="/vllm_project/">vLLM</a> is used in the rollout process with with offloading the engine to CPU and give the GPU back to the kernel to be benchmarked! This is a small feature we implemented to make RLHF smoother with vLLM.
Morph (@morph_labs) 's Twitter Profile Photo

We are excited to announce Trinity, an autoformalization system for verified superintelligence that we have developed at Morph. We have used it to automatically formalize in Lean a classical result of de Bruijn that the abc conjecture is true almost always.

We are excited to announce Trinity, an autoformalization system for verified superintelligence that we have developed at <a href="/morph_labs/">Morph</a>. We have used it to automatically formalize in Lean a classical result of de Bruijn that the abc conjecture is true almost always.
MiniMax (official) (@minimax__ai) 's Twitter Profile Photo

Day 1/5 of #MiniMaxWeek: We’re open-sourcing MiniMax-M1, our latest LLM — setting new standards in long-context reasoning. - World’s longest context window: 1M-token input, 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency:

Day 1/5 of #MiniMaxWeek: We’re open-sourcing MiniMax-M1, our latest LLM — setting new standards in long-context reasoning.

- World’s longest context window: 1M-token input, 80k-token output
- State-of-the-art agentic use among open-source models
- RL at unmatched efficiency:
Sakana AI (@sakanaailabs) 's Twitter Profile Photo

Introducing Reinforcement-Learned Teachers (RLTs): Transforming how we teach LLMs to reason with reinforcement learning (RL). Blog: sakana.ai/rlt Paper: arxiv.org/abs/2506.08388 Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and

Jia Li (@jiali52524397) 's Twitter Profile Photo

Happy to introduce Kimina-Prover-72B ! Reaching 92.2% on miniF2F using Test time RL. It can solve IMO problems using more than 500 lines of Lean 4 code ! Check our blog post here: huggingface.co/blog/AI-MO/kim… And play with our demo ! demo.projectnumina.ai

Happy to introduce Kimina-Prover-72B ! Reaching 92.2% on miniF2F using Test time RL. It can solve IMO problems using more than 500 lines of Lean 4 code !

Check our blog post here:
huggingface.co/blog/AI-MO/kim…
And play with our demo !
demo.projectnumina.ai
Yong Lin (@yong18850571) 's Twitter Profile Photo

(1/4)🚨 Introducing Goedel-Prover V2 🚨 🔥🔥🔥 The strongest open-source theorem prover to date. 🥇 #1 on PutnamBench: Solves 64 problems—with far less compute. 🧠 New SOTA on MiniF2F: * 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%. * 8B > 671B: Our 8B

(1/4)🚨 Introducing Goedel-Prover V2 🚨
🔥🔥🔥 The strongest open-source theorem prover to date.
🥇 #1 on PutnamBench: Solves 64 problems—with far less compute.
🧠 New SOTA on MiniF2F:
* 32B model hits 90.4% at Pass@32, beating DeepSeek-Prover-V2-671B’s 82.4%.
* 8B &gt; 671B: Our 8B
Kaiyu Yang (@kaiyuyang4) 's Twitter Profile Photo

Our Goedel-Prover-V2 doubled the SOTA Pass@32 performance on PutnamBench with a 20x smaller model, making it the strongest open-source theorem prover to date!

Noam Brown (@polynoamial) 's Twitter Profile Photo

Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

Jerry Tworek (@millionint) 's Twitter Profile Photo

Why am I excited about IMO results we just published: - we did very little IMO-specific work, we just keep training general models - all natural language proofs - no evaluation harness We needed a new research breakthrough and Alexander Wei and team delivered

Alexander Wei (@alexwei_) 's Twitter Profile Photo

On IMO P6 (without going into too much detail about our setup), the model "knew" it didn't have a correct solution. The model knowing when it didn't know was one of the early signs of life that made us excited about the underlying research direction!