Jihan Yao (@jihan_yao) Twitter Tweets • TwiCopy

Yifei Zhou

5 months ago

📢 New Preprint: Self-Challenging Agent (SCA) 📢 It’s costly to scale agent tasks with reliable verifiers. In SCA, the key idea is to have another challenger to explore the env and construct tasks along with verifiers. Here is how it achieves 2x improvements on general

thumb_up_off_alt258

chat_bubble_outline14

repeat37

shareShare

Yushi Hu

@huyushi98

5 months ago

When cooking multimodal models, one big headache I found is that the evaluation benchmarks are not reliable — especially for tasks like interleaved generation.😢 The authors make big effort on reliability, crafting one eval pipeline for each task. 🔥For example, they even

thumb_up_off_alt23

chat_bubble_outline0

repeat1

shareShare

Guang Yang

@guangyangnlp

5 months ago

Audio & music evals in multimodal generation are tough—noisy metrics, vague correctness. 🎧😵‍💫 Our new work MMMG improves this with clear tasks, reliable metrics, and human-aligned judgments. 💯 If you’re working on audio/music in multimodal models, this benchmark is a must-try!

thumb_up_off_alt5

chat_bubble_outline0

repeat2

shareShare

Yike Wang

@yikewang_

5 months ago

LLMs are helpful for scientific research — but will they continuously be helpful? Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).

thumb_up_off_alt236

chat_bubble_outline10

repeat53

shareShare

Banghua Zhu

@banghuaz

5 months ago

Excited to share that I’m joining NVIDIA as a Principal Research Scientist! We’ll be joining forces on efforts in model post-training, evaluation, agents, and building better AI infrastructure—with a strong emphasis on collaboration with developers and academia. We’re committed

thumb_up_off_alt2,2K

chat_bubble_outline145

repeat107

shareShare

Michael Hu

@michahu8

4 months ago

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜arxiv.org/abs/2507.00885 🧵1/5👇

thumb_up_off_alt279

chat_bubble_outline4

repeat36

shareShare

Zhiyuan Zeng

@zhiyuanzeng_

4 months ago

EvalTree accepted to Conference on Language Modeling 2025 - my first PhD work and first COLM paper 🙌! What would you like to see next—extensions, applications, or other directions? Always open to ideas! 🧐

EvalTree accepted to <a href="/COLM_conf/">Conference on Language Modeling</a> 2025 - my first PhD work and first COLM paper 🙌!

What would you like to see next—extensions, applications, or other directions? Always open to ideas! 🧐

thumb_up_off_alt199

chat_bubble_outline6

repeat14

shareShare

Yanming Wan

@yanming_wan

4 months ago

Personalization methods for LLMs often rely on extensive user history. We introduce Curiosity-driven User-modeling Reward as Intrinsic Objective (CURIO) to encourage actively learning about the user within multi-turn dialogs. 📜 arxiv.org/abs/2504.03206 🌎 sites.google.com/cs.washington.…

thumb_up_off_alt153

chat_bubble_outline4

repeat35

shareShare

Weijia Shi

@weijiashi2

4 months ago

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

thumb_up_off_alt197

chat_bubble_outline7

repeat59

shareShare

Jihan Yao

@jihan_yao

4 months ago

Very exciting work and an effective way to use low-quality data! This paper broadly extends our findings in “Varying Shades of Wrong” at #iclr2025. Feel free to check at: arxiv.org/abs/2410.11055

thumb_up_off_alt9

chat_bubble_outline0

repeat3

shareShare

Oreva Ahia

@orevaahia

4 months ago

🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).

thumb_up_off_alt159

chat_bubble_outline2

repeat44

shareShare

Jihan Yao

@jihan_yao

4 months ago

I’m excited to present MMMG at the Machine Learning for Audio Workshop #icml2025 this Saturday, 4:00-4:20 p.m.. Would love to see you there and hear your thoughts!

thumb_up_off_alt8

chat_bubble_outline0

repeat2

shareShare

Bingbing Wen

@bingbingwen1

4 months ago

I'll be in Vienna all week for #ACL2025! Excited to present our work on abstention and overconfidence. If you're interested in discussing bidirectional reliability in data and models, I'd love to connect! papers:direct.mit.edu/tacl/article/d… aclanthology.org/2025.findings-…

thumb_up_off_alt20

chat_bubble_outline0

repeat3

shareShare

Shangbin Feng

@shangbinfeng

3 months ago

👀 How to find more difficult/novel/salient evaluation data? ✨ Let the data generators find it for you! Introducing Data Swarms, multiple data generator LMs collaboratively search in the weight space to optimize quantitative desiderata of evaluation.

thumb_up_off_alt114

chat_bubble_outline2

repeat16

shareShare