Fangyuan Xu (@brunchavecmoi) 's Twitter Profile
Fangyuan Xu

@brunchavecmoi

许方园👩🏻‍💻phd student @ nyu, interested in natural language processing
also via fangyuanxu.🦋.social

ID: 1159484728359043072

linkhttps://carriex.github.io/ calendar_today08-08-2019 15:21:16

134 Tweet

453 Followers

570 Following

Anuj Diwan (@anuj_diwan) 's Twitter Profile Photo

Introducing ParaSpeechCaps, our large-scale style captions dataset that enables rich, expressive control for text-to-speech models! Beyond basic pitch or speed controls, our models can generate speech that sounds "guttural", "scared", "whispered" and more; 59 style tags in total.

Introducing ParaSpeechCaps, our large-scale style captions dataset that enables rich, expressive control for text-to-speech models!
Beyond basic pitch or speed controls, our models can generate speech that sounds "guttural", "scared", "whispered" and more; 59 style tags in total.
Yixiao Song (@yixiao_song) 's Twitter Profile Photo

Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web! ✅ Humans achieve 85% accuracy ❌ OpenAI Operator: 24% ❌ Anthropic Computer Use: 14% ❌ Convergence AI Proxy: 13%

Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web!
✅ Humans achieve 85% accuracy
❌ OpenAI Operator: 24%
❌ Anthropic Computer Use: 14%
❌ Convergence AI Proxy: 13%
Amanda Bertsch (@abertsch72) 's Twitter Profile Photo

Emily, Chin-Jou, and Yilin did some careful accounting and clever design to produce DBSA, a many-shot ICL setting that beats fine-tuning on accuracy *and* amortized cost, even when running 100,000+ queries!

Zhaofeng Wu @ ICLR (@zhaofeng_wu) 's Twitter Profile Photo

Robust reward models are critical for alignment/inference-time algos, auto eval, etc. (e.g. to prevent reward hacking which could render alignment ineffective). ⚠️ But we found that SOTA RMs are brittle 🫧 and easily flip predictions when the inputs are slightly transformed 🍃 🧵

Robust reward models are critical for alignment/inference-time algos, auto eval, etc. (e.g. to prevent reward hacking which could render alignment ineffective). ⚠️ But we found that SOTA RMs are brittle 🫧 and easily flip predictions when the inputs are slightly transformed 🍃 🧵
Ramya Namuduri (@ramya_namuduri) 's Twitter Profile Photo

Have that eerie feeling of déjà vu when reading model-generated text 👀, but can’t pinpoint the specific words or phrases 👀? ✨We introduce QUDsim, to quantify discourse similarities beyond lexical, syntactic, and content overlap.

Have that eerie feeling of déjà vu when reading model-generated text 👀, but can’t pinpoint the specific words or phrases 👀?

✨We introduce QUDsim, to quantify discourse similarities beyond lexical, syntactic, and content overlap.
Manya Wadhwa (@manyawadhwa1) 's Twitter Profile Photo

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

Vishakh Padmakumar (@vishakh_pk) 's Twitter Profile Photo

What does it mean for #LLM output to be novel? In work w/ John(Yueh-Han) Chen, Jane Pan, Valerie Chen, He He we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵

What does it mean for #LLM output to be novel?
In work w/ <a href="/jcyhc_ai/">John(Yueh-Han) Chen</a>, <a href="/JanePan_/">Jane Pan</a>, <a href="/valeriechen_/">Valerie Chen</a>,  <a href="/hhexiy/">He He</a> we argue it needs to be both original and high quality. While prompting tricks trade one for the other, better models (scaling/post-training) can shift the novelty frontier 🧵
Yapei Chang (@yapeichang) 's Twitter Profile Photo

🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment? 🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO. 🫐 Introducing BLEUBERI:

Sebastian Joseph (@sebajoed) 's Twitter Profile Photo

How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵

How good are LLMs at 🔭 scientific computing and visualization 🔭?

AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.

SOTA models like Gemini 2.5 Pro &amp; Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵
Fangcong Yin (@fangcong_y10593) 's Twitter Profile Photo

Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!

Solving complex problems with CoT requires combining different skills.

We can do this by:
🧩Modify the CoT data format to be “composable” with other skills
🔥Train models on each skill
📌Combine those models

Lead to better 0-shot reasoning on tasks involving skill composition!
Chau Minh Pham (@chautmpham) 's Twitter Profile Photo

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts?

🧟 You get what we call a Frankentext!

💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.
Xi Ye (@xiye_nlp) 's Twitter Profile Photo

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a

🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval?
📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval

Main contributions:
 🔍 Better head detection: we find a
hyunji amy lee (@hyunji_amy_lee) 's Twitter Profile Photo

🚨 Want models to better utilize and ground on the provided knowledge? We introduce Context-INformed Grounding Supervision (CINGS)! Training LLM with CINGS significantly boosts grounding abilities in both text and vision-language models compared to standard instruction tuning.

🚨 Want models to better utilize and ground on the provided knowledge? We introduce Context-INformed Grounding Supervision (CINGS)! Training LLM with CINGS significantly boosts grounding abilities in both text and vision-language models compared to standard instruction tuning.
CLS (@chengleisi) 's Twitter Profile Photo

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

Are AI scientists already better than human researchers?

We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts.

Main finding: LLM ideas result in worse projects than human ideas.
hyunji amy lee (@hyunji_amy_lee) 's Twitter Profile Photo

🥳Excited to share that I’ll be joining UNC Computer Science as postdoc this fall. Looking forward to work with Mohit Bansal & amazing students at UNC AI. I'll continue working on retrieval, aligning knowledge modules with LLM's parametric knowledge, and expanding to various modalities.

🥳Excited to share that I’ll be joining <a href="/unccs/">UNC Computer Science</a> as postdoc this fall. Looking forward to work with <a href="/mohitban47/">Mohit Bansal</a> &amp; amazing students at <a href="/unc_ai_group/">UNC AI</a>.
I'll continue working on retrieval, aligning knowledge modules with LLM's parametric knowledge, and expanding to various modalities.
Joongwon Kim (@danieljwkim) 's Twitter Profile Photo

Can we improve Llama 3’s reasoning abilities through post-training only? Introducing ASTRO, our new framework that teaches LLMs to perform in-context search and generate long CoT to solve math problems, via SFT and RL. Work done at @aiatmeta. 📄 Paper: arxiv.org/abs/2507.00417

Weijia Shi (@weijiashi2) 's Twitter Profile Photo

Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data

Orion Weller @ ICLR 2025 (@orionweller) 's Twitter Profile Photo

🤔 Have you ever wondered how good ModernBERT is compared to decoders like Llama? We made an open-data version of ModernBERT and used the same recipe for encoders and decoders. Turns out, our encoder model beat ModernBERT and our decoder model beats Llama 3.2 / SmolLM2 🤯 🧵

🤔 Have you ever wondered how good ModernBERT is compared to decoders like Llama?

We made an open-data version of ModernBERT and used the same recipe for encoders and decoders.

Turns out, our encoder model beat ModernBERT and our decoder model beats Llama 3.2 / SmolLM2 🤯

🧵