Yubo Wang (@yubowang726) 's Twitter Profile
Yubo Wang

@yubowang726

Incoming Ph.D. student: Computer Science @UWaterloo MSc: CS @ucdavis BSc: Computer Science @ZJU_China

ID: 1790613752733184000

calendar_today15-05-2024 05:22:53

19 Tweet

63 Followers

28 Following

Yubo Wang (@yubowang726) 's Twitter Profile Photo

🚀 THUDM just released GLM 4. Check out its impressive scores on the MMLU-Pro benchmark: (For more detailed results, visit huggingface.co/spaces/TIGER-L…)

🚀 THUDM just released GLM 4. Check out its impressive scores on the MMLU-Pro benchmark: (For more detailed results, visit huggingface.co/spaces/TIGER-L…)
Yubo Wang (@yubowang726) 's Twitter Profile Photo

🎉 Our paper MMLU-Pro has been selected for a spotlight at the 2024 NeurIPS D&B track! Huge thanks to all co-authors at Tiger-AI-Lab for their support and guidance! 🙏 We hope it can help in the evaluation of LLMs! #NeurIPS2024 #MMLUPro #TigerAILab

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space proj: tiger-ai-lab.github.io/MEGA-Bench/ abs: arxiv.org/abs/2410.10563

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space

proj: tiger-ai-lab.github.io/MEGA-Bench/
abs: arxiv.org/abs/2410.10563
Yubo Wang (@yubowang726) 's Twitter Profile Photo

Excited to share our work on Critique Fine-Tuning (CFT), a new paradigm that teaches language models through critique rather than imitation. With just 50K examples and 8 GPU hours of training, we achieve comparable or better performance than traditional SFT and RL approaches.

Yubo Wang (@yubowang726) 's Twitter Profile Photo

SuperGPQA breaks new ground: 285 disciplines, one massive AI test! Moving beyond basic math & coding to challenge LLMs in agriculture, industry & more. Top AI only hits 61.82% - an incredible milestone in mapping AI capabilities!

Wenhu Chen (@wenhuchen) 's Twitter Profile Photo

🎬 Automated filmmaking is the future — You need dialogue, expressive talking heads, synchronized body motion, and multi-character interactions. 🚀 Today, in collaboration with AI at Meta, we’re excited to introduce MoCha: Towards Movie-Grade Talking Character Synthesis 🔊

Wenhu Chen (@wenhuchen) 's Twitter Profile Photo

🚀 Introducing ScholarCopilot: a next-gen AI assistant designed specifically for professional academic writing! We have done more in-depth evaluation and human study to show that it outperforms ChatGPT significantly in terms of citation accuracy. Our paper is online now:

Wenhu Chen (@wenhuchen) 's Twitter Profile Photo

🔥 How do you build a state-of-the-art Vision-Language Model with direct RL? We’re excited to introduce VL-Rethinker, a new paradigm for multimodal reasoning trained directly with Reinforcement Learning. 📈 It sets new SOTA on key math+vision benchmarks: - MathVista: 80.3 → 🥇

🔥 How do you build a state-of-the-art Vision-Language Model with direct RL?

We’re excited to introduce VL-Rethinker, a new paradigm for multimodal reasoning trained directly with Reinforcement Learning.
📈 It sets new SOTA on key math+vision benchmarks:
- MathVista: 80.3 → 🥇
Wenhu Chen (@wenhuchen) 's Twitter Profile Photo

🚀 General-Reasoner: Generalizing LLM Reasoning Across All Domains (Beyond Math) Most recent RL/R1 works focus on math reasoning—but math-only tuning doesn't generalize to general reasoning (e.g. drop on MMLU-Pro and SuperGPQA). Why are we limited to math reasoning? 1. Existing

🚀 General-Reasoner: Generalizing LLM Reasoning Across All Domains (Beyond Math)

Most recent RL/R1 works focus on math reasoning—but math-only tuning doesn't generalize to general reasoning (e.g. drop on MMLU-Pro and SuperGPQA). Why are we limited to math reasoning?

1. Existing
Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

[1/5] 💥 Facing the LLM Scaling Challenge Head-On! 💥 Glad to introduce MGA: Reformulation for Pretraining Data Augmentation! The AI world is grappling with data limitations and the performance hit from data repetition. We introduce MGA (Massive Genre-Audience)

[1/5] 
💥 Facing the LLM Scaling Challenge Head-On! 💥

Glad to introduce MGA: Reformulation for Pretraining Data Augmentation! 

The AI world is grappling with data limitations and the performance hit from data repetition. 

We introduce MGA (Massive Genre-Audience)
Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

[1/n] 🚨 Game On for LLM Reasoning—Meet KORGym! 🎮✨ Ever wondered how to truly assess an LLM’s reasoning ability beyond memorized knowledge? Meet our latest breakthrough: KORGym—a dynamic, multi-turn game platform built to reveal the real reasoning skills of language models!

[1/n]
🚨 Game On for LLM Reasoning—Meet KORGym! 🎮✨

Ever wondered how to truly assess an LLM’s reasoning ability beyond memorized knowledge? 

Meet our latest breakthrough: KORGym—a dynamic, multi-turn game platform built to reveal the real reasoning skills of language models!
Wenhu Chen (@wenhuchen) 's Twitter Profile Photo

Our General Reasoner paper is coming out on Arxiv at arxiv.org/abs/2505.14652 We have re-trained our general-reasoner models to obtain much better performance! - Our 4B General Reasoner can even beat the NVDIA's Nemotron-CrossThink-7B significantly. - Our 14B General-Reasoner

Our General Reasoner paper is coming out on Arxiv at arxiv.org/abs/2505.14652
We have re-trained our general-reasoner models to obtain much better performance!

- Our 4B General Reasoner can even beat the NVDIA's Nemotron-CrossThink-7B significantly.
- Our 14B General-Reasoner