Xuehai He (@xuehaih) 's Twitter Profile
Xuehai He

@xuehaih

Vision and language

ID: 1318704802973376514

linkhttp://sheehan1230.github.io calendar_today21-10-2020 00:05:04

92 Tweet

182 Followers

304 Following

Kaiwen Zhou (@kaiwenzhou9) 's Twitter Profile Photo

๐Ÿ›ก๏ธ R1 Safety Paper Alert! ๐Ÿ“ฐ How safe are large reasoning models like R1? What is their safety behavior? Does their enhanced capability introduce greater risks? โ€” We present a comprehensive safety analysis on large reasoning models: ๐Ÿ”ฅ Key Findings: 1๏ธโƒฃOpen-source R1 models lag

๐Ÿ›ก๏ธ R1 Safety Paper Alert! ๐Ÿ“ฐ

How safe are large reasoning models like R1? What is their safety behavior? Does their enhanced capability introduce greater risks? โ€” We present a comprehensive safety analysis on large reasoning models:

๐Ÿ”ฅ Key Findings:
1๏ธโƒฃOpen-source R1 models lag
Qianqi "Jackie" Yan (@qianqi_yan) 's Twitter Profile Photo

New Paper Alert: Multimodal Inconsistency Reasoning (MMIR)! โœจ Ever visited a webpage where the text says โ€œIKEA deskโ€ yet images and descriptions elsewhere show a totally different brand? Or read a slide that shows โ€œ50% growthโ€ in the text but the accompanying chart looks flat?

New Paper Alert: Multimodal Inconsistency Reasoning (MMIR)! โœจ

Ever visited a webpage where the text says โ€œIKEA deskโ€ yet images and descriptions elsewhere show a totally different brand? Or read a slide that shows โ€œ50% growthโ€ in the text but the accompanying chart looks flat?
Xin Eric Wang @ ICLR 2025 (@xwang_lk) 's Twitter Profile Photo

๐๐ž๐š๐ญ๐ข๐ง๐  ๐Ž๐ฉ๐ž๐ง๐€๐ˆ ๐ข๐ฌ ๐ง๐จ๐ญ ๐š๐ฌ ๐ก๐š๐ซ๐ ๐š๐ฌ ๐ฒ๐จ๐ฎ ๐ญ๐ก๐ข๐ง๐ค. If you don't believe you can compete, you've already lost. Winning starts with mindset. ๐Ÿš€Introducing ๐‘จ๐’ˆ๐’†๐’๐’• ๐‘บ2, ๐ญ๐ก๐ž ๐ฐ๐จ๐ซ๐ฅ๐'๐ฌ ๐›๐ž๐ฌ๐ญ ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž๐ซ-๐ฎ๐ฌ๐ž ๐š๐ ๐ž๐ง๐ญ, and the second

๐๐ž๐š๐ญ๐ข๐ง๐  ๐Ž๐ฉ๐ž๐ง๐€๐ˆ ๐ข๐ฌ ๐ง๐จ๐ญ ๐š๐ฌ ๐ก๐š๐ซ๐ ๐š๐ฌ ๐ฒ๐จ๐ฎ ๐ญ๐ก๐ข๐ง๐ค. If you don't believe you can compete, you've already lost. Winning starts with mindset.

๐Ÿš€Introducing ๐‘จ๐’ˆ๐’†๐’๐’• ๐‘บ2, ๐ญ๐ก๐ž ๐ฐ๐จ๐ซ๐ฅ๐'๐ฌ ๐›๐ž๐ฌ๐ญ ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž๐ซ-๐ฎ๐ฌ๐ž ๐š๐ ๐ž๐ง๐ญ, and the second
Xin Eric Wang @ ICLR 2025 (@xwang_lk) 's Twitter Profile Photo

Since launching Agent S2, many folks working on GUI/computer-use agents asked for our tech report. Here we go! ๐ŸŽ‰New SOTA on 3 major computer use benchmarks. โ€ข OSWorld (15 steps): 27.0% ๐Ÿš€ (+18.9%) โ€ข OSWorld (50 steps): 34.5% ๐Ÿš€ (+32.7%) โ€ข WindowsAgentArena: 29.8% ๐Ÿš€

Since launching Agent S2, many folks working on GUI/computer-use agents asked for our tech report. Here we go! ๐ŸŽ‰New SOTA on 3 major computer use benchmarks. 

โ€ข OSWorld (15 steps): 27.0% ๐Ÿš€ (+18.9%)
โ€ข OSWorld (50 steps): 34.5% ๐Ÿš€ (+32.7%)
โ€ข WindowsAgentArena: 29.8% ๐Ÿš€
Xuehai He (@xuehaih) 's Twitter Profile Photo

Hello, every ICLR participant! I am going to present our MMWorld on 4/24 9:30 am - 11:30 am SST)โœˆ๏ธโœˆ๏ธ. Welcome to our poster for a chat and discuss the next generation video understanding model!๐Ÿ˜Ž๐Ÿ˜Ž Looking forward to meeting old and new friends there! #ICLR2025

Hello, every ICLR participant! I am going to present our MMWorld  on 4/24 9:30 am - 11:30 am SST)โœˆ๏ธโœˆ๏ธ. Welcome to our poster for a chat and discuss the next generation video understanding model!๐Ÿ˜Ž๐Ÿ˜Ž Looking forward to meeting old and new friends there! #ICLR2025
Simular (@simularai) 's Twitter Profile Photo

Meet ๐—ฆ๐—ถ๐—บ๐˜‚๐—น๐—ฎ๐—ฟ โ€” the first AI agent that browses the Internet "with" you, right on your Mac. It acts locally on macOS and works alongside you. Take over anytime or team up with Simular in real time. See how Simular makes everyday digital life faster, easier, and smarter.

Meet ๐—ฆ๐—ถ๐—บ๐˜‚๐—น๐—ฎ๐—ฟ โ€” the first AI agent that browses the Internet "with" you, right on your Mac.

It acts locally on macOS and works alongside you.

Take over anytime or team up with Simular in real time.

See how Simular makes everyday digital life faster, easier, and smarter.
Jianwei Yang (@jw2yang4ai) 's Twitter Profile Photo

๐Ÿš€ Excited to announce our 4th Workshop on Computer Vision in the Wild (CVinW) at #CVPR2025 2025! ๐Ÿ”— computer-vision-in-the-wild.github.io/cvpr-2025/ โญWe have invinted a great lineup of speakers: Prof. Kaiming He, Prof. Boqing Gong, Prof. Cordelia Schmid, Prof. Ranjay Krishna, Prof. Saining Xie, Prof.

๐Ÿš€ Excited to announce our 4th Workshop on Computer Vision in the Wild (CVinW) at <a href="/CVPR/">#CVPR2025</a> 2025!
๐Ÿ”— computer-vision-in-the-wild.github.io/cvpr-2025/

โญWe have invinted a great lineup of speakers: Prof. Kaiming He, Prof. <a href="/BoqingGo/">Boqing Gong</a>, Prof. <a href="/CordeliaSchmid/">Cordelia Schmid</a>, Prof. <a href="/RanjayKrishna/">Ranjay Krishna</a>, Prof. <a href="/sainingxie/">Saining Xie</a>, Prof.
Yiping Wang (@ypwang61) 's Twitter Profile Photo

We only need ONE example for RLVR on LLMs to achieve significant improvement on math tasks! ๐Ÿ“RLVR with one training example can boost: - Qwen2.5-Math-1.5B: 36.0% โ†’ 73.6% - Qwen2.5-Math-7B: 51.0% โ†’ 79.2% on MATH500. ๐Ÿ“„ Paper: arxiv.org/abs/2504.20571

We only need ONE example for RLVR on LLMs to achieve significant improvement on math tasks!

๐Ÿ“RLVR with one training example can boost:
         - Qwen2.5-Math-1.5B: 36.0% โ†’ 73.6%
         - Qwen2.5-Math-7B: 51.0% โ†’ 79.2% 
       on MATH500.

๐Ÿ“„ Paper: arxiv.org/abs/2504.20571
Saaket Agashe @ NAACL 2025 (@saa1605) 's Twitter Profile Photo

Iโ€™ll be presenting our poster: โ€œLLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Modelsโ€ tomorrow at #NAACL2025! โฐ 11 AM ๐Ÿ“ Hall 3 Drop by to chat about applications of LLMs for Multi-Agent Coordination! #MultiAgentAI #LLMs

Iโ€™ll be presenting our poster:
โ€œLLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Modelsโ€
tomorrow at #NAACL2025!

โฐ 11 AM
๐Ÿ“ Hall 3

Drop by to chat about applications of LLMs for Multi-Agent Coordination! #MultiAgentAI #LLMs
Xin Eric Wang @ ICLR 2025 (@xwang_lk) 's Twitter Profile Photo

๐˜๐˜ถ๐˜ฎ๐˜ข๐˜ฏ๐˜ด ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜ฌ ๐˜ง๐˜ญ๐˜ถ๐˜ช๐˜ฅ๐˜ญ๐˜บโ€”๐˜ฏ๐˜ข๐˜ท๐˜ช๐˜จ๐˜ข๐˜ต๐˜ช๐˜ฏ๐˜จ ๐˜ข๐˜ฃ๐˜ด๐˜ต๐˜ณ๐˜ข๐˜ค๐˜ต ๐˜ค๐˜ฐ๐˜ฏ๐˜ค๐˜ฆ๐˜ฑ๐˜ต๐˜ด ๐˜ฆ๐˜ง๐˜ง๐˜ฐ๐˜ณ๐˜ต๐˜ญ๐˜ฆ๐˜ด๐˜ด๐˜ญ๐˜บ, ๐˜ง๐˜ณ๐˜ฆ๐˜ฆ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ณ๐˜ช๐˜จ๐˜ช๐˜ฅ ๐˜ญ๐˜ช๐˜ฏ๐˜จ๐˜ถ๐˜ช๐˜ด๐˜ต๐˜ช๐˜ค ๐˜ฃ๐˜ฐ๐˜ถ๐˜ฏ๐˜ฅ๐˜ข๐˜ณ๐˜ช๐˜ฆ๐˜ด. But current reasoning models remain constrained by discrete tokens, limiting their full

๐˜๐˜ถ๐˜ฎ๐˜ข๐˜ฏ๐˜ด ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜ฌ ๐˜ง๐˜ญ๐˜ถ๐˜ช๐˜ฅ๐˜ญ๐˜บโ€”๐˜ฏ๐˜ข๐˜ท๐˜ช๐˜จ๐˜ข๐˜ต๐˜ช๐˜ฏ๐˜จ ๐˜ข๐˜ฃ๐˜ด๐˜ต๐˜ณ๐˜ข๐˜ค๐˜ต ๐˜ค๐˜ฐ๐˜ฏ๐˜ค๐˜ฆ๐˜ฑ๐˜ต๐˜ด ๐˜ฆ๐˜ง๐˜ง๐˜ฐ๐˜ณ๐˜ต๐˜ญ๐˜ฆ๐˜ด๐˜ด๐˜ญ๐˜บ, ๐˜ง๐˜ณ๐˜ฆ๐˜ฆ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ณ๐˜ช๐˜จ๐˜ช๐˜ฅ ๐˜ญ๐˜ช๐˜ฏ๐˜จ๐˜ถ๐˜ช๐˜ด๐˜ต๐˜ช๐˜ค ๐˜ฃ๐˜ฐ๐˜ถ๐˜ฏ๐˜ฅ๐˜ข๐˜ณ๐˜ช๐˜ฆ๐˜ด. But current reasoning models remain constrained by discrete tokens, limiting their full
Yue Fan (@yfan_ucsc) 's Twitter Profile Photo

Before o3 impressed everyone with ๐Ÿ”ฅvisual reasoning๐Ÿ”ฅ, we already had faith in and were exploring models that can think with images. ๐Ÿš€ Hereโ€™s our shot, GRIT: Grounded Reasoning with Images & Texts that trains MLLMs to think while performing visual grounding. It is done via RL

Before o3 impressed everyone with ๐Ÿ”ฅvisual reasoning๐Ÿ”ฅ, we already had faith in and were exploring models that can think with images. ๐Ÿš€

Hereโ€™s our shot, GRIT: Grounded Reasoning with Images &amp; Texts that trains MLLMs to think while performing visual grounding. It is done via RL
Kaiwen Zhou (@kaiwenzhou9) 's Twitter Profile Photo

๐Ÿ›ก๏ธ Improved Safe for Large Reasoning Models! ๐Ÿง  How can we better align large reasoning models (LRMs) against unseen jailbreaks and harmful prompts? We present SafeKey โ€” a LRM alignment method that helps activate the aha-moment of safety reasoning. ๐Ÿ”ฅ Key Points: 1๏ธโƒฃ Aha-moment

๐Ÿ›ก๏ธ Improved Safe for Large Reasoning Models! ๐Ÿง 

How can we better align large reasoning models (LRMs) against unseen jailbreaks and harmful prompts? We present SafeKey โ€” a LRM alignment method that helps activate the aha-moment of safety reasoning.

๐Ÿ”ฅ Key Points:

1๏ธโƒฃ Aha-moment
Yiping Wang (@ypwang61) 's Twitter Profile Photo

I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, as we recently updated(x.com/ypwang61/statuโ€ฆโ€ฆ) But I disagree with some points in the blog for 1-shot RLVR. 1. For Deepseek-R1-Distill-Qwen-1.5B, we set

I agree that having a consistent evaluation pipeline and better illustrating format and non-format gain are important, as we recently updated(x.com/ypwang61/statuโ€ฆโ€ฆ) But I disagree with some points in the blog for 1-shot RLVR.

  1. For Deepseek-R1-Distill-Qwen-1.5B, we set
Qianqi "Jackie" Yan (@qianqi_yan) 's Twitter Profile Photo

๐Ÿš€ New paper out! โ€œHidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Modelsโ€ Real life is messy: ๐Ÿ”น โ€œMake a cup of apple juiceโ€ - but no apples are in sight ๐Ÿ”น โ€œSay hi to my friendโ€ - yet two people are in the frame ๐Ÿ”น โ€œTell me the brand of the lipstickโ€ -

๐Ÿš€ New paper out! โ€œHidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Modelsโ€

Real life is messy:
๐Ÿ”น โ€œMake a cup of apple juiceโ€ - but no apples are in sight
๐Ÿ”น โ€œSay hi to my friendโ€ - yet two people are in the frame
๐Ÿ”น โ€œTell me the brand of the lipstickโ€ -