Mehran Kazemi (@kazemi_sm) Twitter Tweets • TwiCopy

Tanishq Mathew Abraham, Ph.D.

9 months ago

BIG-Bench Extra Hard Google DeepMind introduces BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. "BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly

thumb_up_off_alt143

chat_bubble_outline4

repeat40

shareShare

Arian Hosseini

@ariantbd

9 months ago

Happy to see a flavor 🧂 of Compositional GSM in this work Comp GSM: arxiv.org/abs/2410.01748 BBEH: arxiv.org/abs/2502.19187

thumb_up_off_alt9

chat_bubble_outline0

repeat2

shareShare

Mehran Kazemi

@kazemi_sm

9 months ago

“It’s been only <N> days since <Company> released <Model> and people can’t stop building with it” is sooooo cliche! Almost half of my timeline every day! Are AI influencers all using the same LLM for writing their tweets?

thumb_up_off_alt7

chat_bubble_outline1

repeat0

shareShare

Mehran Kazemi

@kazemi_sm

9 months ago

Glad to see our BIG-Bench Extra Hard paper featured.

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Eduardo Sánchez

@eduardosg_ai

9 months ago

Happy to see that Linguini, our benchmark for language-agnostic linguistic reasoning, has been included in DeepMind’s BIG-Bench Extra Hard (BBEH). Linguini remains challenging for reasoning models, being one of only two (hard) tasks where o3-mini doesn't show massive gains.

thumb_up_off_alt15

chat_bubble_outline1

repeat3

shareShare

Reza Bayat

@reza_byt

9 months ago

New Paper Alert!📄 "It’s better to be sparse than to be dense" ✨ We explore how to steer LLMs (like Gemma-2 2B & 9B) by modifying their activations in sparse spaces, enabling more precise, interpretable control & improved monosemanticity with scaling. Let’s break it down! 🧵

thumb_up_off_alt324

chat_bubble_outline3

repeat60

shareShare

Mehran Kazemi

@kazemi_sm

9 months ago

The evaluation code is now available at: github.com/google-deepmin… Also consider submitting to our leaderboard: github.com/google-deepmin…

thumb_up_off_alt49

chat_bubble_outline0

repeat7

shareShare

Mehran Kazemi

@kazemi_sm

9 months ago

Checkout our latest Gemma models, you'll love them :) Also checkout the results on our BIG-Bench Extra Hard benchmark:

thumb_up_off_alt39

chat_bubble_outline2

repeat4

shareShare

Mehran Kazemi

@kazemi_sm

8 months ago

We’ve had several papers rejected primarily because of this rule and because many researchers don’t know about it. A NeurIPS AC even accused us of lying, and an ACL AC literally said “it’s your problem”. Hope Sam Altman’s tweet increases the awareness about this rule.

thumb_up_off_alt15

chat_bubble_outline1

repeat0

shareShare

Reyhane Askari

@reyhaneaskari

8 months ago

Excited to be at #ICLR2025 next week! I'm currently on the job market for Research Scientist positions, especially in generative modeling, synthetic data, diffusion models, or responsible AI. Feel free to reach out if you have any openings!

thumb_up_off_alt65

chat_bubble_outline0

repeat12

shareShare

Hritik Bansal

@hbxnov

7 months ago

✈️ I will be at ICLR 2026 🇸🇬 to present the following work on LLM reasoning, vision-language understanding, and LLM evaluation w/ uclanlp, UCLA Machine Intelligence (MINT), and Google DeepMind! Come to the poster sessions and say hi 👋 I will be happy to meet folks from

✈️ I will be at <a href="/iclr_conf/">ICLR 2026</a> 🇸🇬 to present the following work on LLM reasoning, vision-language understanding, and LLM evaluation w/ <a href="/uclanlp/">uclanlp</a>, UCLA Machine Intelligence (MINT), and <a href="/GoogleDeepMind/">Google DeepMind</a>!

Come to the poster sessions and say hi 👋 I will be happy to meet folks from

thumb_up_off_alt70

chat_bubble_outline1

repeat8

shareShare

Mohammad Pezeshki

@mpezeshki91

7 months ago

I'm presenting our recent work on "Pitfalls of Memorization" today at ICLR Number #304 at 3pm.. Come say hi! iclr.cc/virtual/2025/p…

thumb_up_off_alt98

chat_bubble_outline1

repeat15

shareShare

Arian Hosseini

@ariantbd

7 months ago

New Paper! 📣 RL^V: a unified RL & generative verifier —boosts MATH accuracy by 20% and improves both sequential and parallel test-time scaling ☑️ improves out-of-domain and easy-to-hard generalization ☑️ allows dynamic allocation of compute for harder problems How? 👇🏻

thumb_up_off_alt41

chat_bubble_outline0

repeat5

shareShare

Mehran Kazemi

@kazemi_sm

7 months ago

Upon some requests, we now have a BBEH Mini with 460 examples (20 per task) for faster and cheaper experimentation. The set can be downloaded from: github.com/google-deepmin… The results are reported in Table 3 of arxiv.org/pdf/2502.19187

thumb_up_off_alt31

chat_bubble_outline0

repeat7

shareShare

vinh q. tran

@vqctran

6 months ago

crazy to see a wild research idea I dreamt about for a long time land into this Gemini 2.5 Flash! it's been amazing making this happen with Yi Tay and Quoc Le, and the rest of the team!!!

thumb_up_off_alt145

chat_bubble_outline5

repeat8

shareShare

Hritik Bansal

@hbxnov

6 months ago

Great to see that the latest #GeminiDiffusion release benchmarks on our challenging general-purpose reasoning Big Bench Extra Hard dataset! It is now available on HF 🤗: huggingface.co/datasets/BBEH/… Eval code: github.com/google-deepmin…

thumb_up_off_alt14

chat_bubble_outline0

repeat3

shareShare

Bahare Fatemi

@baharefatemi

6 months ago

I'm excited to be speaking at the 1st Workshop on Large Language Models for Cross-Temporal Research at COLM 2025 in Montreal! This workshop is tackling crucial issues around LLMs' understanding of time. Don't miss out! More details & submission deadlines: lnkd.in/eaaAR_z2

thumb_up_off_alt35

chat_bubble_outline2

repeat2

shareShare