Martin Vechev (@mvechev) Twitter Tweets • TwiCopy

António Costa

7 months ago

Inspiring visit to INSAIT Institute at Sofia Tech Park, the first institute of its kind in Eastern Europe. Its cutting-edge technology will allow countries to quickly catch up and advance on the AI front. And the upcoming BRAIN++ AI Factory, part of the EU-wide AI hub network,

Inspiring visit to <a href="/INSAITinstitute/">INSAIT Institute</a> at <a href="/SofiaTechPark/">Sofia Tech Park</a>, the first institute of its kind in Eastern Europe.

Its cutting-edge technology will allow countries to quickly catch up and advance on the AI front.

And the upcoming BRAIN++ AI Factory, part of the EU-wide AI hub network,

thumb_up_off_alt101

chat_bubble_outline21

repeat23

shareShare

INSAIT Institute

@insaitinstitute

7 months ago

🇪🇺 🇧🇬 Today, António Costa António Costa, visited INSAIT during his official visit to Bulgaria. The visit was also attended by Prime Minister of Bulgaria Rosen Zhelyazkov. Prof. Martin Vechev and Eng. Borislav Petrov presented Mr. Costa with the achievements of the institute, which

🇪🇺 🇧🇬 Today, António Costa <a href="/eucopresident/">António Costa</a>, visited INSAIT during his official visit to Bulgaria. The visit was also attended by Prime Minister of Bulgaria Rosen Zhelyazkov. Prof. <a href="/mvechev/">Martin Vechev</a> and Eng. Borislav Petrov presented Mr. Costa with the achievements of the institute, which

thumb_up_off_alt7

chat_bubble_outline0

repeat3

shareShare

Mislav Balunović

@mbalunovic

6 months ago

Two updates from MathArena: - DeepSeek-R1-0528 shows strong performance very close to top closed source models on all competitions - We released a research paper about our evaluation methodology and more detailed analysis of results

thumb_up_off_alt69

chat_bubble_outline2

repeat8

shareShare

Nikola Jovanović @ ICLR 🇸🇬

@ni_jovanovic

5 months ago

There's a lot of work now on LLM watermarking. But can we extend this to transformers trained for autoregressive image generation? Yes, but it's not straightforward 🧵(1/10)

thumb_up_off_alt316

chat_bubble_outline6

repeat53

shareShare

Jasper Dekoninck

@j_dekoninck

5 months ago

Thrilled to share a major step forward for AI for mathematical proof generation! We are releasing the Open Proof Corpus: the largest ever public collection of human-annotated LLM-generated math proofs, and a large-scale study over this dataset!

thumb_up_off_alt37

chat_bubble_outline1

repeat20

shareShare

INSAIT Institute

@insaitinstitute

5 months ago

🤝We are delighted to announce that INSAIT is starting a joint research program with the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), one of the world’s leading and most influential research labs! 🚀All details оn the joint program will be announced

thumb_up_off_alt28

chat_bubble_outline0

repeat3

shareShare

INSAIT Institute

@insaitinstitute

5 months ago

🌐 We are delighted to announce the launch of a new 1 million USD joint research program between INSAIT and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), one of the top research labs in the world! 🎓 The program enables incoming INSAIT tenure-track

thumb_up_off_alt12

chat_bubble_outline1

repeat2

shareShare

Mark Müller

@mnmueller

4 months ago

🚨 AI agents wrote 7% of all GitHub PRs in June. But can we trust their code? We built Agents in the Wild – a live dashboard tracking autonomous AI agents across GitHub to answer that question: insights.logicstar.ai Here’s what we learned from analyzing 10M+ PRs 👇 1/n 🧵

thumb_up_off_alt10

chat_bubble_outline2

repeat5

shareShare

Jasper Dekoninck

@j_dekoninck

4 months ago

Grok-4 takes first place on the MathArena Leaderboard! Convincing scores across the board, with an especially impressive performance on HMMT 2025. Full results are available on matharena.ai. (1/3)

thumb_up_off_alt7

chat_bubble_outline1

repeat3

shareShare

Jasper Dekoninck

@j_dekoninck

4 months ago

On the SMT, a competition of 53 questions that is currently kept private, Grok-4 also convinces, but is not outperforming o4-mini and o3. (2/3)

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Jasper Dekoninck

@j_dekoninck

4 months ago

As models are getting close to saturating our main automated benchmarks, we are currently looking towards more challenging competitions. Some very exciting updates coming up for that in the coming days and weeks, so stay tuned! (3/3)

thumb_up_off_alt4

chat_bubble_outline0

repeat2

shareShare

Mislav Balunović

@mbalunovic

4 months ago

We are launching Project Euler on MathArena to track performance of LLMs on challenging new problems at the intersection of mathematics and programming which are published every week on Project Euler website 🧵(1/6)

thumb_up_off_alt63

chat_bubble_outline4

repeat12

shareShare

Jasper Dekoninck

@j_dekoninck

4 months ago

Interesting approach! However, we looked at the proofs and methodology and we found a few problems, specifically with the use of hints given to the model. While the scaffold indeed improves performance, it does not solve all problems accurately and would not get a gold medal.🧵

thumb_up_off_alt153

chat_bubble_outline5

repeat17

shareShare

Jasper Dekoninck

@j_dekoninck

4 months ago

We launched a new competition on MathArena: Evaluation on the International Mathematics Competition for University students! Our goal: Verify the gold medals on the IMO by testing agentic Gemini-2.5-Pro and Gemini IMO Deep Think The results: The models aced the competition. 🧵

thumb_up_off_alt199

chat_bubble_outline11

repeat28

shareShare

Jasper Dekoninck

@j_dekoninck

3 months ago

Impressive performance of GPT OSS on MathArena, taking shared first place on the final-answer comps! **Very important** note: we ended up running the models locally, as APIs are unreliable at this time. Do not trust benchmark results ran with APIs 🧵

thumb_up_off_alt23

chat_bubble_outline3

repeat6

shareShare

Jasper Dekoninck

@j_dekoninck

3 months ago

Results for GPT-5 on MathArena are out 🎉 The results: Final-answer benchmarks: Slightly outperforming all other models IMO 2025: Higher score than others, but due to small size of the IMO, we cannot say anything about the ranking Project Euler: Crushing the competition. 🧵

thumb_up_off_alt309

chat_bubble_outline7

repeat27

shareShare

Niels Mündler

@nielstron

3 months ago

1/ If you want to skip the thread, jump directly visit our Website (with Demo, Coder + Paper): constrained-diffusion.ai Otherwise, find a short writeup in the thread below... 👇

thumb_up_off_alt4

chat_bubble_outline1

repeat2

shareShare

Nikola Jovanović @ ICLR 🇸🇬

@ni_jovanovic

3 months ago

Introducing MathArena Apex: A set of curated final-answer problems from recent competitions that even best LLMs still can't solve. Top models are correct at most 5% of the time🧵 (1/8)

thumb_up_off_alt127

chat_bubble_outline4

repeat19

shareShare