Martin Vechev (@mvechev) 's Twitter Profile
Martin Vechev

@mvechev

Professor of Computer Science, ETH Zurich. Founder of INSAIT (insait.ai). Works on Safe/Secure AI, LLMs, Quantum. Co-founder of 6 Deep-Tech start-ups.

ID: 615004307

linkhttps://www.sri.inf.ethz.ch/people/martin calendar_today22-06-2012 09:24:46

183 Tweet

1,1K Followers

25 Following

António Costa (@eucopresident) 's Twitter Profile Photo

Inspiring visit to INSAIT Institute at Sofia Tech Park, the first institute of its kind in Eastern Europe. Its cutting-edge technology will allow countries to quickly catch up and advance on the AI front. And the upcoming BRAIN++ AI Factory, part of the EU-wide AI hub network,

Inspiring visit to <a href="/INSAITinstitute/">INSAIT Institute</a> at <a href="/SofiaTechPark/">Sofia Tech Park</a>, the first institute of its kind in Eastern Europe. 

Its cutting-edge technology will allow countries to quickly catch up and advance on the AI front.

And the upcoming BRAIN++ AI Factory, part of the EU-wide AI hub network,
INSAIT Institute (@insaitinstitute) 's Twitter Profile Photo

🇪🇺 🇧🇬 Today, António Costa António Costa, visited INSAIT during his official visit to Bulgaria. The visit was also attended by Prime Minister of Bulgaria Rosen Zhelyazkov. Prof. Martin Vechev and Eng. Borislav Petrov presented Mr. Costa with the achievements of the institute, which

🇪🇺 🇧🇬 Today, António Costa <a href="/eucopresident/">António Costa</a>, visited INSAIT during his official visit to Bulgaria. The visit was also attended by Prime Minister of Bulgaria Rosen Zhelyazkov. Prof. <a href="/mvechev/">Martin Vechev</a> and Eng. Borislav Petrov presented Mr. Costa with the achievements of the institute, which
Mislav Balunović (@mbalunovic) 's Twitter Profile Photo

Two updates from MathArena: - DeepSeek-R1-0528 shows strong performance very close to top closed source models on all competitions - We released a research paper about our evaluation methodology and more detailed analysis of results

Nikola Jovanović @ ICLR 🇸🇬 (@ni_jovanovic) 's Twitter Profile Photo

There's a lot of work now on LLM watermarking. But can we extend this to transformers trained for autoregressive image generation? Yes, but it's not straightforward 🧵(1/10)

Jasper Dekoninck (@j_dekoninck) 's Twitter Profile Photo

Thrilled to share a major step forward for AI for mathematical proof generation! We are releasing the Open Proof Corpus: the largest ever public collection of human-annotated LLM-generated math proofs, and a large-scale study over this dataset!

Thrilled to share a major step forward for AI for mathematical proof generation! 

We are releasing the Open Proof Corpus: the largest ever public collection of human-annotated LLM-generated math proofs, and a large-scale study over this dataset!
INSAIT Institute (@insaitinstitute) 's Twitter Profile Photo

🤝We are delighted to announce that INSAIT is starting a joint research program with the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), one of the world’s leading and most influential research labs! 🚀All details оn the joint program will be announced

🤝We are delighted to announce that INSAIT is starting a joint research program with the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), one of the world’s leading and most influential research labs!

🚀All details оn the joint program will be announced
INSAIT Institute (@insaitinstitute) 's Twitter Profile Photo

🌐 We are delighted to announce the launch of a new 1 million USD joint research program between INSAIT and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), one of the top research labs in the world! 🎓 The program enables incoming INSAIT tenure-track

🌐 We are delighted to announce the launch of a new 1 million USD joint research program between INSAIT and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), one of the top research labs in the world!

🎓 The program enables incoming INSAIT tenure-track
Mark Müller (@mnmueller) 's Twitter Profile Photo

🚨 AI agents wrote 7% of all GitHub PRs in June. But can we trust their code? We built Agents in the Wild – a live dashboard tracking autonomous AI agents across GitHub to answer that question: insights.logicstar.ai Here’s what we learned from analyzing 10M+ PRs 👇 1/n 🧵

Jasper Dekoninck (@j_dekoninck) 's Twitter Profile Photo

Grok-4 takes first place on the MathArena Leaderboard! Convincing scores across the board, with an especially impressive performance on HMMT 2025. Full results are available on matharena.ai. (1/3)

Grok-4 takes first place on the MathArena Leaderboard! Convincing scores across the board, with an especially impressive performance on HMMT 2025. Full results are available on matharena.ai.

(1/3)
Jasper Dekoninck (@j_dekoninck) 's Twitter Profile Photo

On the SMT, a competition of 53 questions that is currently kept private, Grok-4 also convinces, but is not outperforming o4-mini and o3. (2/3)

On the SMT, a competition of 53 questions that is currently kept private, Grok-4 also convinces, but is not outperforming o4-mini and o3. 

(2/3)
Jasper Dekoninck (@j_dekoninck) 's Twitter Profile Photo

As models are getting close to saturating our main automated benchmarks, we are currently looking towards more challenging competitions. Some very exciting updates coming up for that in the coming days and weeks, so stay tuned! (3/3)

Mislav Balunović (@mbalunovic) 's Twitter Profile Photo

We are launching Project Euler on MathArena to track performance of LLMs on challenging new problems at the intersection of mathematics and programming which are published every week on Project Euler website 🧵(1/6)

We are launching Project Euler on MathArena to track performance of LLMs on challenging new problems at the intersection of mathematics and programming which are published every week on Project Euler website 🧵(1/6)
Jasper Dekoninck (@j_dekoninck) 's Twitter Profile Photo

Interesting approach! However, we looked at the proofs and methodology and we found a few problems, specifically with the use of hints given to the model. While the scaffold indeed improves performance, it does not solve all problems accurately and would not get a gold medal.🧵

Jasper Dekoninck (@j_dekoninck) 's Twitter Profile Photo

We launched a new competition on MathArena: Evaluation on the International Mathematics Competition for University students! Our goal: Verify the gold medals on the IMO by testing agentic Gemini-2.5-Pro and Gemini IMO Deep Think The results: The models aced the competition. 🧵

We launched a new competition on MathArena: Evaluation on the International Mathematics Competition for University students!

Our goal: Verify the gold medals on the IMO by testing agentic Gemini-2.5-Pro and Gemini IMO Deep Think

The results: The models aced the competition. 🧵
Jasper Dekoninck (@j_dekoninck) 's Twitter Profile Photo

Impressive performance of GPT OSS on MathArena, taking shared first place on the final-answer comps! **Very important** note: we ended up running the models locally, as APIs are unreliable at this time. Do not trust benchmark results ran with APIs 🧵

Impressive performance of GPT OSS on MathArena, taking shared first place on the final-answer comps!

**Very important** note: we ended up running the models locally, as APIs are unreliable at this time. Do not trust benchmark results ran with APIs 🧵
Jasper Dekoninck (@j_dekoninck) 's Twitter Profile Photo

Results for GPT-5 on MathArena are out 🎉 The results: Final-answer benchmarks: Slightly outperforming all other models IMO 2025: Higher score than others, but due to small size of the IMO, we cannot say anything about the ranking Project Euler: Crushing the competition. 🧵

Niels Mündler (@nielstron) 's Twitter Profile Photo

1/ If you want to skip the thread, jump directly visit our Website (with Demo, Coder + Paper): constrained-diffusion.ai Otherwise, find a short writeup in the thread below... 👇

Nikola Jovanović @ ICLR 🇸🇬 (@ni_jovanovic) 's Twitter Profile Photo

Introducing MathArena Apex: A set of curated final-answer problems from recent competitions that even best LLMs still can't solve. Top models are correct at most 5% of the time🧵 (1/8)

Introducing MathArena Apex: A set of curated final-answer problems from recent competitions that even best LLMs still can't solve. Top models are correct at most 5% of the time🧵 (1/8)