SRI Lab (@the_sri_lab) 's Twitter Profile
SRI Lab

@the_sri_lab

ID: 1051516728553996288

linkhttps://www.sri.inf.ethz.ch/ calendar_today14-10-2018 16:54:59

189 Tweet

714 Followers

166 Following

Mislav Balunović (@mbalunovic) 's Twitter Profile Photo

MathArena results for HMMT Feb 2025 are out, showing that high school math competitions are still far from being solved by frontier LLMs, with only o3-mini crossing the 50% mark!

MathArena results for HMMT Feb 2025 are out, showing that high school math competitions are still far from being solved by frontier LLMs, with only o3-mini crossing the 50% mark!
Mislav Balunović (@mbalunovic) 's Twitter Profile Photo

How good are LLMs at producing constructive proofs? In our latest paper we introduce MathConstruct, a benchmark consisting of challenging olympiad-level problems where solution requires proof by construction.

How good are LLMs at producing constructive proofs? In our latest paper we introduce MathConstruct, a benchmark consisting of challenging olympiad-level problems where solution requires proof by construction.
Mislav Balunović (@mbalunovic) 's Twitter Profile Photo

Can LLMs actually solve hard math problems? Given the strong performance at AIME, we now go to the next tier: our MathArena team has conducted a detailed evaluation using the recent 2025 USA Math Olympiad. The results are… bad: all models scored less than 5%!

Can LLMs actually solve hard math problems? Given the strong performance at AIME, we now go to the next tier: our MathArena team has conducted a detailed evaluation using the recent 2025 USA Math Olympiad. The results are… bad: all models scored less than 5%!
Mislav Balunović (@mbalunovic) 's Twitter Profile Photo

Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing.

Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing.
SRI Lab (@the_sri_lab) 's Twitter Profile Photo

Check out this recent work from our lab showing that benign-looking LLM's can hide backdoors that activate upon finetuning!