taesiri (@taesiri) Twitter Tweets • TwiCopy

Jonathan Roberts

6 months ago

📢📢More progress on ZeroBench! With the release of Claude 4 from Anthropic the SOTA pass@1 is now 4% 🔥 Claude Sonnet 3.7: 1% Claude Sonnet 3.7 (Thinking): 3% Claude Sonnet 4: 2% Claude Sonnet 4 (Thinking): 3% Claude Opus 4: 1% Claude Opus 4 (Thinking): 4%

thumb_up_off_alt15

chat_bubble_outline1

repeat2

shareShare

Anh Nguyen (Totti)

@anh_ng8

6 months ago

Asking GPT-4o for a random choice is an *easy* way to reveal its bias 🙃 Choose a random digit? ➡️ 7 (70% of the time❗️) Biden vs. Trump? ➡️ Biden (100%❗️) Male vs. Female? ➡️ Female (84%❗️) Same story for many LLMs. Choice orders are randomized. 1/6 #icml2025

thumb_up_off_alt15

chat_bubble_outline1

repeat4

shareShare

Rohan Paul

@rohanpaul_ai

6 months ago

Large language models often exhibit biases in single interactions. Allowing LLMs to observe prior responses in multi-turn conversations helps them reduce bias, especially for random answers. A new metric, B-score, effectively detects biases across different question types.

thumb_up_off_alt25

chat_bubble_outline0

repeat4

shareShare

Anh Nguyen (Totti)

@anh_ng8

6 months ago

🧵 Vision Language Models are ⚠️ biased Q: Count the legs of this animal? 🤖: 4 ❌ Same problem: - w/ 5 best VLMs: GPT-4.1, o3, o4-mini, Gemini 2.5 Pro, Sonnet 3.7 - on 7 domains: animals, logos, flags, chess, boardgames, optical illusions code, paper vlmsarebiased.github.io

thumb_up_off_alt81

chat_bubble_outline10

repeat15

shareShare

Anh Nguyen (Totti)

@anh_ng8

5 months ago

How do best AI image editors 🤖 GPT-4o, Gemini 2.0, SeedEdit, HF 🤗 fare ⚔️ human Photoshop wizards 🧙‍♀️ on text-based 🏞️ image editing? Logan Logan Bolton and Brandon Brandon Collins shared some answers at our poster today! #CVPR2025 psrdataset.github.io A few insights 👇

thumb_up_off_alt12

chat_bubble_outline1

repeat6

shareShare

Anh Nguyen (Totti)

@anh_ng8

5 months ago

Pooyan Pooyan Rahmanzadehgervi presenting our Transformer Attention Bottleneck paper at #CVPR2026 💡 We **simplify** MHSA (e.g. 12 heads -> 1 head) to create an attention **bottleneck** where users can debug Vision Language Models by editing the bottleneck and observe expected VLM text outputs.

thumb_up_off_alt6

chat_bubble_outline1

repeat5

shareShare

An Vo

@an_vo12

4 months ago

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: vlmsarebiased.github.io/#example-galle… 1/n #ICML2025

thumb_up_off_alt293

chat_bubble_outline7

repeat41

shareShare

Cohere Labs

@cohere_labs

4 months ago

Supported by one of our grants, An Vo, Mohammad Reza Taesiri, and Anh Totti Nguyen from KAIST AI, tackled bias in LLMs. Their research shows that LLMs exhibit fewer biases when they can see their previous answers, leading to the development of the B-score metric.

Supported by one of our grants, <a href="/an_vo12/">An Vo</a>, Mohammad Reza Taesiri, and <a href="/anh_ng8/">Anh Totti Nguyen</a> from <a href="/kaist_ai/">KAIST AI</a>, tackled bias in LLMs. Their research shows that LLMs exhibit fewer biases when they can see their previous answers, leading to the development of the B-score metric.

thumb_up_off_alt8

chat_bubble_outline2

repeat3

shareShare

Anh Nguyen (Totti)

@anh_ng8

4 months ago

#GPT5 is STILL having a severe confirmation bias like prev SOTA models! 😜 Try yourselves (images, prompts avail in 1 click): vlmsarebiased.github.io It's fast to test for such biases in images. Similar biases should still exist in non-image domains as well...

thumb_up_off_alt121

chat_bubble_outline11

repeat14

shareShare

Mathew

@mrnuu

4 months ago

Anh Totti Nguyen taesiri An Vo GPT5 Pro, the most advanced version, reasoned for 1 minute 19 seconds and came up with the wrong answer. Same with GPT5 Thinking. Amazing level of incompetence from the smartest model yet.

<a href="/anh_ng8/">Anh Totti Nguyen</a> <a href="/taesiri/">taesiri</a> <a href="/an_vo12/">An Vo</a> GPT5 Pro, the most advanced version, reasoned for 1 minute 19 seconds and came up with the wrong answer.

Same with GPT5 Thinking.

Amazing level of incompetence from the smartest model yet.

thumb_up_off_alt7

chat_bubble_outline0

repeat3

shareShare

Lucas Beyer (bl16)

@giffmana

4 months ago

Oh wow, this VLM benchmark is pure evil, and I love it! "Vision Language Models are Biased" by An Vo, taesiri, Anh Totti Nguyen, etal. Also really good idea to have one-click copy-paste of images and prompts, makes trying it super easy.

Oh wow, this VLM benchmark is pure evil, and I love it!

"Vision Language Models are Biased" by <a href="/an_vo12/">An Vo</a>, <a href="/taesiri/">taesiri</a>, <a href="/anh_ng8/">Anh Totti Nguyen</a>, etal.

Also really good idea to have one-click copy-paste of images and prompts, makes trying it super easy.

thumb_up_off_alt937

chat_bubble_outline32

repeat75

shareShare

(((ل()(ل() 'yoav))))👾

@yoavgo

4 months ago

beautiful adversarial dataset playing exactly on the soft-spot of VLMs.

thumb_up_off_alt279

chat_bubble_outline5

repeat20

shareShare

Mehran Jalali

@mehran__jalali

4 months ago

Really great eval LLMs are already great at next token prediction in the training corpus, so the test for truth-seeking is now whether they can answer correctly when the correct next token is at odds with the next token in the most similar example in the training corpus.

thumb_up_off_alt105

chat_bubble_outline7

repeat16

shareShare

taesiri

@taesiri

3 months ago

With all the attention on the new 🍌 Nano Banana, we (with Brandon Collins) stress-tested it with our "shirt-color chain" benchmark: 26 sequential edits where each output feeds into the next. Learn more our study on image editing models here: psrdataset.github.io

thumb_up_off_alt18

chat_bubble_outline3

repeat0

shareShare

Anh Nguyen (Totti)

@anh_ng8

3 months ago

Preserving details was a major challenge across SOTA AI editors (SeedEdit, GPT-4o, Gemini 2.0, and HuggingFace models) in our recent study. AIs either accidentally alter or degrade the background/identity, or they enhance aesthetics (when not requested). psrdataset.github.io

thumb_up_off_alt5

chat_bubble_outline1

repeat2

shareShare

Anh Nguyen (Totti)

@anh_ng8

2 months ago

yuyin zhou@ICCV2025 NeurIPS Conference We have a paper in the same situation. AC: Yes! PC: No no. NeurIPS Conference please consider the whether 1st author is a student and whether this would be their first top-tier paper BEFORE making such a cut. More healthy for junior researchers. OR use a Findings track.

thumb_up_off_alt10

chat_bubble_outline0

repeat2

shareShare

taesiri

@taesiri

2 months ago

Excited to share that our paper VideoGameQA-Bench 🎮 has been accepted to the #NeurIPS2025 Datasets & Benchmarks Track! 🎉 We introduce a benchmark for video game quality assurance comprising 9 tasks, including visual unit testing, visual regression, needle-in-a-haystack,

thumb_up_off_alt16

chat_bubble_outline1

repeat0

shareShare

taesiri

@taesiri

2 months ago

OpenAI’s Sora 2 is, without a doubt, the most impressive video generation model right now. As always, I had to test it with some of my own evals. Sora 2 still struggles to fill a container with gas or smoke without clipping through the container’s boundaries. (Veo 3 also have

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

taesiri

@taesiri

a month ago

A short compilation of physical glitches in Sora 2. It seems the model still struggles with everyday actions like getting into a car or using an umbrella. Some prompts lead to bizarre outputs. While Sora's physical understanding has improved dramatically, it still can't generate

thumb_up_off_alt47

chat_bubble_outline3

repeat8

shareShare