Jihan Yao (@jihan_yao) 's Twitter Profile
Jihan Yao

@jihan_yao

PhD student @uwcse. Prev. @Tsinghua_Uni. Self-improving of large foundation models.

ID: 1716308968715456512

calendar_today23-10-2023 04:22:07

15 Tweet

40 Followers

80 Following

Yifei Zhou (@yifeizhou02) 's Twitter Profile Photo

📢 New Preprint: Self-Challenging Agent (SCA) 📢 It’s costly to scale agent tasks with reliable verifiers. In SCA, the key idea is to have another challenger to explore the env and construct tasks along with verifiers. Here is how it achieves 2x improvements on general

📢 New Preprint: Self-Challenging Agent (SCA) 📢

It’s costly to scale agent tasks with reliable verifiers.

In SCA, the key idea is to have another challenger to explore the env and construct  tasks along with verifiers.

Here is how it achieves 2x improvements on general
Yushi Hu (@huyushi98) 's Twitter Profile Photo

When cooking multimodal models, one big headache I found is that the evaluation benchmarks are not reliable — especially for tasks like interleaved generation.😢 The authors make big effort on reliability, crafting one eval pipeline for each task. 🔥For example, they even

Guang Yang (@guangyangnlp) 's Twitter Profile Photo

Audio & music evals in multimodal generation are tough—noisy metrics, vague correctness. 🎧😵‍💫 Our new work MMMG improves this with clear tasks, reliable metrics, and human-aligned judgments. 💯 If you’re working on audio/music in multimodal models, this benchmark is a must-try!

Yike Wang (@yikewang_) 's Twitter Profile Photo

LLMs are helpful for scientific research — but will they continuously be helpful? Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).

LLMs are helpful for scientific research — but will they continuously be helpful?

Introducing 🔍ScienceMeter: current knowledge update methods enable 86% preservation of prior scientific knowledge, 72% acquisition of new, and 38%+ projection of future (arxiv.org/abs/2505.24302).
Banghua Zhu (@banghuaz) 's Twitter Profile Photo

Excited to share that I’m joining NVIDIA as a Principal Research Scientist! We’ll be joining forces on efforts in model post-training, evaluation, agents, and building better AI infrastructure—with a strong emphasis on collaboration with developers and academia. We’re committed

Excited to share that I’m joining NVIDIA as a Principal Research Scientist!

We’ll be joining forces on efforts in model post-training, evaluation, agents, and building better AI infrastructure—with a strong emphasis on collaboration with developers and academia. We’re committed
Michael Hu (@michahu8) 's Twitter Profile Photo

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜arxiv.org/abs/2507.00885 🧵1/5👇

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule.

a quick read about scaling law fails: 
📜arxiv.org/abs/2507.00885

🧵1/5👇
Zhiyuan Zeng (@zhiyuanzeng_) 's Twitter Profile Photo

EvalTree accepted to Conference on Language Modeling 2025 - my first PhD work and first COLM paper 🙌! What would you like to see next—extensions, applications, or other directions? Always open to ideas! 🧐

EvalTree accepted to <a href="/COLM_conf/">Conference on Language Modeling</a> 2025 - my first PhD work and first COLM paper 🙌!

What would you like to see next—extensions, applications, or other directions? Always open to ideas! 🧐
Yanming Wan (@yanming_wan) 's Twitter Profile Photo

Personalization methods for LLMs often rely on extensive user history. We introduce Curiosity-driven User-modeling Reward as Intrinsic Objective (CURIO) to encourage actively learning about the user within multi-turn dialogs. 📜 arxiv.org/abs/2504.03206 🌎 sites.google.com/cs.washington.…

Weijia Shi (@weijiashi2) 's Twitter Profile Photo

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

Jihan Yao (@jihan_yao) 's Twitter Profile Photo

Very exciting work and an effective way to use low-quality data! This paper broadly extends our findings in “Varying Shades of Wrong” at #iclr2025. Feel free to check at: arxiv.org/abs/2410.11055

Oreva Ahia (@orevaahia) 's Twitter Profile Photo

🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).

🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).
Jihan Yao (@jihan_yao) 's Twitter Profile Photo

I’m excited to present MMMG at the Machine Learning for Audio Workshop #icml2025 this Saturday, 4:00-4:20 p.m.. Would love to see you there and hear your thoughts!

Bingbing Wen (@bingbingwen1) 's Twitter Profile Photo

I'll be in Vienna all week for #ACL2025! Excited to present our work on abstention and overconfidence. If you're interested in discussing bidirectional reliability in data and models, I'd love to connect! papers:direct.mit.edu/tacl/article/d… aclanthology.org/2025.findings-…

I'll be in Vienna all week for #ACL2025! Excited to present our work on abstention and overconfidence. If you're interested in discussing bidirectional reliability in data and models, I'd love to connect!
papers:direct.mit.edu/tacl/article/d…
aclanthology.org/2025.findings-…
Shangbin Feng (@shangbinfeng) 's Twitter Profile Photo

👀 How to find more difficult/novel/salient evaluation data? ✨ Let the data generators find it for you! Introducing Data Swarms, multiple data generator LMs collaboratively search in the weight space to optimize quantitative desiderata of evaluation.

👀 How to find more difficult/novel/salient evaluation data?
✨ Let the data generators find it for you!

Introducing Data Swarms, multiple data generator LMs collaboratively search in the weight space to optimize quantitative desiderata of evaluation.