taesiri (@taesiri) 's Twitter Profile
taesiri

@taesiri

PhD graduate UofA, Working on Large Multimodal Models.

ID: 827975822095036416

linkhttps://taesiri.ai/ calendar_today04-02-2017 20:23:13

542 Tweet

715 Followers

4,4K Following

Jonathan Roberts (@jrobertsai) 's Twitter Profile Photo

📢📢More progress on ZeroBench! With the release of Claude 4 from Anthropic the SOTA pass@1 is now 4% 🔥 Claude Sonnet 3.7: 1% Claude Sonnet 3.7 (Thinking): 3% Claude Sonnet 4: 2% Claude Sonnet 4 (Thinking): 3% Claude Opus 4: 1% Claude Opus 4 (Thinking): 4%

Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

Asking GPT-4o for a random choice is an *easy* way to reveal its bias 🙃 Choose a random digit? ➡️ 7 (70% of the time❗️) Biden vs. Trump? ➡️ Biden (100%❗️) Male vs. Female? ➡️ Female (84%❗️) Same story for many LLMs. Choice orders are randomized. 1/6 #icml2025

Asking GPT-4o for a random choice is an *easy* way to reveal its bias 🙃 

Choose a random digit? ➡️ 7 (70% of the time❗️)
Biden vs. Trump? ➡️ Biden (100%❗️)
Male vs. Female? ➡️ Female (84%❗️)

Same story for many LLMs. 
Choice orders are randomized. 
1/6 
#icml2025
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Large language models often exhibit biases in single interactions. Allowing LLMs to observe prior responses in multi-turn conversations helps them reduce bias, especially for random answers. A new metric, B-score, effectively detects biases across different question types.

Large language models often exhibit biases in single interactions.

Allowing LLMs to observe prior responses in multi-turn conversations helps them reduce bias, especially for random answers.

A new metric, B-score, effectively detects biases across different question types.
Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

🧵 Vision Language Models are ⚠️ biased Q: Count the legs of this animal? 🤖: 4 ❌ Same problem: - w/ 5 best VLMs: GPT-4.1, o3, o4-mini, Gemini 2.5 Pro, Sonnet 3.7 - on 7 domains: animals, logos, flags, chess, boardgames, optical illusions code, paper vlmsarebiased.github.io

🧵 Vision Language Models are ⚠️ biased

Q: Count the legs of this animal?
🤖: 4 ❌

Same problem:
- w/ 5 best VLMs: GPT-4.1, o3, o4-mini, Gemini 2.5 Pro, Sonnet 3.7
- on 7 domains: animals, logos, flags, chess, boardgames, optical illusions

code, paper vlmsarebiased.github.io
Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

How do best AI image editors 🤖 GPT-4o, Gemini 2.0, SeedEdit, HF 🤗 fare ⚔️ human Photoshop wizards 🧙‍♀️ on text-based 🏞️ image editing? Logan Logan Bolton and Brandon Brandon Collins shared some answers at our poster today! #CVPR2025 psrdataset.github.io A few insights 👇

Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

Pooyan Pooyan Rahmanzadehgervi presenting our Transformer Attention Bottleneck paper at #CVPR2026 💡 We **simplify** MHSA (e.g. 12 heads -> 1 head) to create an attention **bottleneck** where users can debug Vision Language Models by editing the bottleneck and observe expected VLM text outputs.

An Vo (@an_vo12) 's Twitter Profile Photo

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: vlmsarebiased.github.io/#example-galle… 1/n #ICML2025

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️

See simple cases where VLMs get it wrong, no matter how you prompt them. 

🧪 Think your VLM can do better? Try it yourself here: vlmsarebiased.github.io/#example-galle…

1/n
#ICML2025
Cohere Labs (@cohere_labs) 's Twitter Profile Photo

Supported by one of our grants, An Vo, Mohammad Reza Taesiri, and Anh Totti Nguyen from KAIST AI, tackled bias in LLMs. Their research shows that LLMs exhibit fewer biases when they can see their previous answers, leading to the development of the B-score metric.

Supported by one of our grants, <a href="/an_vo12/">An Vo</a>, Mohammad Reza Taesiri, and <a href="/anh_ng8/">Anh Totti Nguyen</a> from <a href="/kaist_ai/">KAIST AI</a>, tackled bias in LLMs. Their research shows that LLMs exhibit fewer biases when they can see their previous answers, leading to the development of the B-score metric.
Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

#GPT5 is STILL having a severe confirmation bias like prev SOTA models! 😜 Try yourselves (images, prompts avail in 1 click): vlmsarebiased.github.io It's fast to test for such biases in images. Similar biases should still exist in non-image domains as well...

#GPT5 is STILL having a severe confirmation bias like  prev SOTA models! 😜

Try yourselves (images, prompts avail in 1 click):
vlmsarebiased.github.io

It's fast to test for such biases in images. Similar biases should still exist in non-image domains as well...
Mathew (@mrnuu) 's Twitter Profile Photo

Anh Totti Nguyen taesiri An Vo GPT5 Pro, the most advanced version, reasoned for 1 minute 19 seconds and came up with the wrong answer. Same with GPT5 Thinking. Amazing level of incompetence from the smartest model yet.

<a href="/anh_ng8/">Anh Totti Nguyen</a> <a href="/taesiri/">taesiri</a> <a href="/an_vo12/">An Vo</a> GPT5 Pro, the most advanced version, reasoned for 1 minute 19 seconds and came up with the wrong answer. 

Same with GPT5 Thinking.

Amazing level of incompetence from the smartest model yet.
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Oh wow, this VLM benchmark is pure evil, and I love it! "Vision Language Models are Biased" by An Vo, taesiri, Anh Totti Nguyen, etal. Also really good idea to have one-click copy-paste of images and prompts, makes trying it super easy.

Oh wow, this VLM benchmark is pure evil, and I love it!

"Vision Language Models are Biased" by <a href="/an_vo12/">An Vo</a>, <a href="/taesiri/">taesiri</a>, <a href="/anh_ng8/">Anh Totti Nguyen</a>, etal.

Also really good idea to have one-click copy-paste of images and prompts, makes trying it super easy.
Mehran Jalali (@mehran__jalali) 's Twitter Profile Photo

Really great eval LLMs are already great at next token prediction in the training corpus, so the test for truth-seeking is now whether they can answer correctly when the correct next token is at odds with the next token in the most similar example in the training corpus.

Really great eval

LLMs are already great at next token prediction in the training corpus, so the test for truth-seeking is now whether they can answer correctly when the correct next token is at odds with the next token in the most similar example in the training corpus.
taesiri (@taesiri) 's Twitter Profile Photo

With all the attention on the new 🍌 Nano Banana, we (with Brandon Collins) stress-tested it with our "shirt-color chain" benchmark: 26 sequential edits where each output feeds into the next. Learn more our study on image editing models here: psrdataset.github.io

Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

Preserving details was a major challenge across SOTA AI editors (SeedEdit, GPT-4o, Gemini 2.0, and HuggingFace models) in our recent study. AIs either accidentally alter or degrade the background/identity, or they enhance aesthetics (when not requested). psrdataset.github.io

Anh Nguyen (Totti) (@anh_ng8) 's Twitter Profile Photo

yuyin zhou@ICCV2025 NeurIPS Conference We have a paper in the same situation. AC: Yes! PC: No no. NeurIPS Conference please consider the whether 1st author is a student and whether this would be their first top-tier paper BEFORE making such a cut. More healthy for junior researchers. OR use a Findings track.

taesiri (@taesiri) 's Twitter Profile Photo

Excited to share that our paper VideoGameQA-Bench 🎮 has been accepted to the #NeurIPS2025 Datasets & Benchmarks Track! 🎉 We introduce a benchmark for video game quality assurance comprising 9 tasks, including visual unit testing, visual regression, needle-in-a-haystack,

Excited to share that our paper VideoGameQA-Bench 🎮 has been accepted to the #NeurIPS2025  Datasets &amp; Benchmarks Track! 🎉

We introduce a benchmark for video game quality assurance comprising 9 tasks, including visual unit testing, visual regression, needle-in-a-haystack,
taesiri (@taesiri) 's Twitter Profile Photo

OpenAI’s Sora 2 is, without a doubt, the most impressive video generation model right now. As always, I had to test it with some of my own evals. Sora 2 still struggles to fill a container with gas or smoke without clipping through the container’s boundaries. (Veo 3 also have

taesiri (@taesiri) 's Twitter Profile Photo

A short compilation of physical glitches in Sora 2. It seems the model still struggles with everyday actions like getting into a car or using an umbrella. Some prompts lead to bizarre outputs. While Sora's physical understanding has improved dramatically, it still can't generate