Fangyuan Xu (@brunchavecmoi) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Introducing ParaSpeechCaps, our large-scale style captions dataset that enables rich, expressive control for text-to-speech models! Beyond basic pitch or speed controls, our models can generate speech that sounds "guttural", "scared", "whispered" and more; 59 style tags in total.

thumb_up_off_alt71

chat_bubble_outline3

repeat17

shareShare

Yixiao Song

@yixiao_song

4 months ago

Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web! ✅ Humans achieve 85% accuracy ❌ OpenAI Operator: 24% ❌ Anthropic Computer Use: 14% ❌ Convergence AI Proxy: 13%

thumb_up_off_alt53

chat_bubble_outline1

repeat19

shareShare

Amanda Bertsch

@abertsch72

4 months ago

Emily, Chin-Jou, and Yilin did some careful accounting and clever design to produce DBSA, a many-shot ICL setting that beats fine-tuning on accuracy *and* amortized cost, even when running 100,000+ queries!

thumb_up_off_alt23

chat_bubble_outline1

repeat3

shareShare

Zhaofeng Wu @ ICLR

@zhaofeng_wu

4 months ago

Robust reward models are critical for alignment/inference-time algos, auto eval, etc. (e.g. to prevent reward hacking which could render alignment ineffective). ⚠️ But we found that SOTA RMs are brittle 🫧 and easily flip predictions when the inputs are slightly transformed 🍃 🧵

thumb_up_off_alt166

chat_bubble_outline4

repeat31

shareShare

Ramya Namuduri

@ramya_namuduri

3 months ago

Have that eerie feeling of déjà vu when reading model-generated text 👀, but can’t pinpoint the specific words or phrases 👀? ✨We introduce QUDsim, to quantify discourse similarities beyond lexical, syntactic, and content overlap.

thumb_up_off_alt36

chat_bubble_outline1

repeat15

shareShare

Manya Wadhwa

@manyawadhwa1

3 months ago

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

thumb_up_off_alt116

chat_bubble_outline4

repeat33

shareShare

Vishakh Padmakumar

@vishakh_pk

3 months ago

What does it mean for #LLM output to be novel? In work w/ John(Yueh-Han) Chen, Jane Pan, Valerie Chen, He He we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵

What does it mean for #LLM output to be novel?
In work w/ <a href="/jcyhc_ai/">John(Yueh-Han) Chen</a>, <a href="/JanePan_/">Jane Pan</a>, <a href="/valeriechen_/">Valerie Chen</a>, <a href="/hhexiy/">He He</a> we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵

thumb_up_off_alt82

chat_bubble_outline2

repeat22

shareShare

Yapei Chang

@yapeichang

2 months ago

🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment? 🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO. 🫐 Introducing BLEUBERI:

thumb_up_off_alt191

chat_bubble_outline6

repeat41

shareShare

Sebastian Joseph

@sebajoed

2 months ago

How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵

thumb_up_off_alt18

chat_bubble_outline1

repeat8

shareShare

Fangcong Yin

@fangcong_y10593

2 months ago

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

thumb_up_off_alt87

chat_bubble_outline5

repeat31

shareShare

Chau Minh Pham

@chautmpham

2 months ago

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

thumb_up_off_alt115

chat_bubble_outline4

repeat33

shareShare

Xi Ye

@xiye_nlp

a month ago

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a

thumb_up_off_alt66

chat_bubble_outline1

repeat17

shareShare

hyunji amy lee

@hyunji_amy_lee

a month ago

🚨 Want models to better utilize and ground on the provided knowledge? We introduce Context-INformed Grounding Supervision (CINGS)! Training LLM with CINGS significantly boosts grounding abilities in both text and vision-language models compared to standard instruction tuning.

thumb_up_off_alt48

chat_bubble_outline1

repeat22

shareShare

CLS

@chengleisi

25 days ago

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

thumb_up_off_alt553

chat_bubble_outline10

repeat162

shareShare

hyunji amy lee

@hyunji_amy_lee

23 days ago

🥳Excited to share that I’ll be joining UNC Computer Science as postdoc this fall. Looking forward to work with Mohit Bansal & amazing students at UNC AI. I'll continue working on retrieval, aligning knowledge modules with LLM's parametric knowledge, and expanding to various modalities.

🥳Excited to share that I’ll be joining <a href="/unccs/">UNC Computer Science</a> as postdoc this fall. Looking forward to work with <a href="/mohitban47/">Mohit Bansal</a> & amazing students at <a href="/unc_ai_group/">UNC AI</a>.
I'll continue working on retrieval, aligning knowledge modules with LLM's parametric knowledge, and expanding to various modalities.

thumb_up_off_alt159

chat_bubble_outline20

repeat26

shareShare

Joongwon Kim

@danieljwkim

22 days ago

Can we improve Llama 3’s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. 📄 Paper: arxiv.org/abs/2507.00417

thumb_up_off_alt234

chat_bubble_outline5

repeat46

shareShare

Weijia Shi

@weijiashi2

16 days ago

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

thumb_up_off_alt197

chat_bubble_outline7

repeat59

shareShare

Orion Weller @ ICLR 2025

@orionweller

9 days ago

🤔 Have you ever wondered how good ModernBERT is compared to decoders like Llama? We made an open-data version of ModernBERT and used the same recipe for encoders and decoders. Turns out, our encoder model beat ModernBERT and our decoder model beats Llama 3.2 / SmolLM2 🤯 🧵

thumb_up_off_alt213

chat_bubble_outline11

repeat48

shareShare

Fangyuan Xu

Gate.io

Anuj Diwan

Yixiao Song

Amanda Bertsch

Zhaofeng Wu @ ICLR

Ramya Namuduri

Manya Wadhwa

Vishakh Padmakumar

Yapei Chang

Sebastian Joseph

Fangcong Yin

Chau Minh Pham

Xi Ye

hyunji amy lee

CLS

hyunji amy lee

Joongwon Kim

Weijia Shi

Orion Weller @ ICLR 2025