Mehran Kazemi (@kazemi_sm) 's Twitter Profile
Mehran Kazemi

@kazemi_sm

Staff Research Scientist @GoogleDeepMind. Research areas: Large Language Models, Reasoning, Artificial General Intelligence. Views my own.

ID: 872330935584403456

linkhttps://mehran-k.github.io/ calendar_today07-06-2017 05:54:36

320 Tweet

1,1K Followers

534 Following

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

BIG-Bench Extra Hard Google DeepMind introduces BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. "BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly

BIG-Bench Extra Hard

Google DeepMind introduces BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. 

"BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly
Arian Hosseini (@ariantbd) 's Twitter Profile Photo

Happy to see a flavor 🧂 of Compositional GSM in this work Comp GSM: arxiv.org/abs/2410.01748 BBEH: arxiv.org/abs/2502.19187

Mehran Kazemi (@kazemi_sm) 's Twitter Profile Photo

“It’s been only <N> days since <Company> released <Model> and people can’t stop building with it” is sooooo cliche! Almost half of my timeline every day! Are AI influencers all using the same LLM for writing their tweets?

Eduardo Sánchez (@eduardosg_ai) 's Twitter Profile Photo

Happy to see that Linguini, our benchmark for language-agnostic linguistic reasoning, has been included in DeepMind’s BIG-Bench Extra Hard (BBEH). Linguini remains challenging for reasoning models, being one of only two (hard) tasks where o3-mini doesn't show massive gains.

Happy to see that Linguini, our benchmark for language-agnostic linguistic reasoning, has been included in DeepMind’s BIG-Bench Extra Hard (BBEH).

Linguini remains challenging for reasoning models, being one of only two (hard) tasks where o3-mini doesn't show massive gains.
Reza Bayat (@reza_byt) 's Twitter Profile Photo

New Paper Alert!📄 "It’s better to be sparse than to be dense" ✨ We explore how to steer LLMs (like Gemma-2 2B & 9B) by modifying their activations in sparse spaces, enabling more precise, interpretable control & improved monosemanticity with scaling. Let’s break it down! 🧵

New Paper Alert!📄

"It’s better to be sparse than to be dense" ✨

We explore how to steer LLMs (like Gemma-2 2B &amp; 9B) by modifying their activations in sparse spaces, enabling more precise, interpretable control &amp; improved monosemanticity with scaling.

Let’s break it down! 🧵
Mehran Kazemi (@kazemi_sm) 's Twitter Profile Photo

The evaluation code is now available at: github.com/google-deepmin… Also consider submitting to our leaderboard: github.com/google-deepmin…

Mehran Kazemi (@kazemi_sm) 's Twitter Profile Photo

We’ve had several papers rejected primarily because of this rule and because many researchers don’t know about it. A NeurIPS AC even accused us of lying, and an ACL AC literally said “it’s your problem”. Hope Sam Altman’s tweet increases the awareness about this rule.

Reyhane Askari (@reyhaneaskari) 's Twitter Profile Photo

Excited to be at #ICLR2025 next week! I'm currently on the job market for Research Scientist positions, especially in generative modeling, synthetic data, diffusion models, or responsible AI. Feel free to reach out if you have any openings!

Hritik Bansal (@hbxnov) 's Twitter Profile Photo

✈️ I will be at ICLR 2026 🇸🇬 to present the following work on LLM reasoning, vision-language understanding, and LLM evaluation w/ uclanlp, UCLA Machine Intelligence (MINT), and Google DeepMind! Come to the poster sessions and say hi 👋 I will be happy to meet folks from

✈️ I will be at  <a href="/iclr_conf/">ICLR 2026</a> 🇸🇬 to present the following work on LLM reasoning, vision-language understanding, and LLM evaluation w/ <a href="/uclanlp/">uclanlp</a>, UCLA Machine Intelligence (MINT), and <a href="/GoogleDeepMind/">Google DeepMind</a>!

Come to the poster sessions and say hi 👋 I will be happy to meet folks from
Mohammad Pezeshki (@mpezeshki91) 's Twitter Profile Photo

I'm presenting our recent work on "Pitfalls of Memorization" today at ICLR Number #304 at 3pm.. Come say hi! iclr.cc/virtual/2025/p…

I'm presenting our recent work on "Pitfalls of Memorization" today at ICLR
Number #304 at 3pm..
Come say hi!
iclr.cc/virtual/2025/p…
Arian Hosseini (@ariantbd) 's Twitter Profile Photo

New Paper! 📣 RL^V: a unified RL & generative verifier —boosts MATH accuracy by 20% and improves both sequential and parallel test-time scaling ☑️ improves out-of-domain and easy-to-hard generalization ☑️ allows dynamic allocation of compute for harder problems How? 👇🏻

Mehran Kazemi (@kazemi_sm) 's Twitter Profile Photo

Upon some requests, we now have a BBEH Mini with 460 examples (20 per task) for faster and cheaper experimentation. The set can be downloaded from: github.com/google-deepmin… The results are reported in Table 3 of arxiv.org/pdf/2502.19187

vinh q. tran (@vqctran) 's Twitter Profile Photo

crazy to see a wild research idea I dreamt about for a long time land into this Gemini 2.5 Flash! it's been amazing making this happen with Yi Tay and Quoc Le, and the rest of the team!!!

Hritik Bansal (@hbxnov) 's Twitter Profile Photo

Great to see that the latest #GeminiDiffusion release benchmarks on our challenging general-purpose reasoning Big Bench Extra Hard dataset! It is now available on HF 🤗: huggingface.co/datasets/BBEH/… Eval code: github.com/google-deepmin…

Great to see that the latest #GeminiDiffusion release benchmarks on our challenging general-purpose reasoning Big Bench Extra Hard dataset! 

It is now available on HF 🤗: huggingface.co/datasets/BBEH/…
Eval code: github.com/google-deepmin…
Bahare Fatemi (@baharefatemi) 's Twitter Profile Photo

I'm excited to be speaking at the 1st Workshop on Large Language Models for Cross-Temporal Research at COLM 2025 in Montreal! This workshop is tackling crucial issues around LLMs' understanding of time. Don't miss out! More details & submission deadlines: lnkd.in/eaaAR_z2