Xiangyu Qi (@xiangyuqi_pton) 's Twitter Profile
Xiangyu Qi

@xiangyuqi_pton

Research @openai | PhD @Princeton | Prev @GoogleAI @GoogleDeepMind

ID: 1210827897419681792

linkhttp://xiangyuqi.com calendar_today28-12-2019 07:40:50

827 Tweet

1,1K Followers

923 Following

OpenAI Developers (@openaidevs) 's Twitter Profile Photo

Remember reinforcement fine-tuning? We’ve been working away at it since last December, and it’s available today with OpenAI o4-mini! RFT uses chain-of-thought reasoning and task-specific grading to improve model performance—especially useful for complex domains. Take

OpenAI Developers (@openaidevs) 's Twitter Profile Photo

You can now connect GitHub repos to deep research in ChatGPT. 🐙 Ask a question and the deep research agent will read and search the repo’s source code and PRs, returning a detailed report with citations. Hit deep research → GitHub to get started.

Kenneth Li (@ke_li_2021) 's Twitter Profile Photo

🧵1/ Everyone says toxic data = bad models. But what if more toxic data could help us build less toxic models? Our new paper explores this paradox. Here’s what we found 👇

🧵1/
Everyone says toxic data = bad models.
But what if more toxic data could help us build less toxic models?
Our new paper explores this paradox. Here’s what we found 👇
Peter Henderson (@peterhndrsn) 's Twitter Profile Photo

House Energy and Commerce reconciliation text has language preempting all state AI regulations: "no state or political subdivision may enforce any law or regulation regulating artificial intelligence models, artificial intelligence systems, or automated decision systems during

House Energy and Commerce reconciliation text has language preempting all state AI regulations: "no state or political subdivision may enforce any law or regulation regulating artificial intelligence models, artificial intelligence systems, or automated decision systems during
Princeton University (@princeton) 's Twitter Profile Photo

Princeton engineers have identified a universal weakness in AI chatbots that allows users to bypass safety guardrails and elicit directions for malicious uses, from creating nerve gas to hacking government databases. bit.ly/3SzRto7

Peter Henderson (@peterhndrsn) 's Twitter Profile Photo

There are so many hallucinated citations in court nowadays, that I'm starting to put together a tracker. Check it out and feel free to send ones that I've missed along. New tabs coming for more categories of AI+Law cases!

There are so many hallucinated citations in court nowadays, that I'm starting to put together a tracker.

Check it out and feel free to send ones that I've missed along.

New tabs coming for more categories of AI+Law cases!
Flavio Adamo (@flavioad) 's Twitter Profile Photo

I asked Codex to convert a legacy project from Python 2.7 to 3.11 and from Django 1.x to 5.0 It literally took 12 minutes If you know, that’s usually weeks of pain This is actually insane

I asked Codex to convert a legacy project from Python 2.7 to 3.11 and from Django 1.x to 5.0

It literally took 12 minutes
If you know, that’s usually weeks of pain

This is actually insane
Anthony Peng (@realanthonypeng) 's Twitter Profile Photo

🚨 New work: We rethink how we finetune safer LLMs — not by filtering after the generation, but by tracking safety risk token by token during training. We repurpose guardrail models like 🛡️ Llama Guard and Granite Guardian to score evolving risk across each response 📉 — giving

🚨 New work: We rethink how we finetune safer LLMs — not by filtering after the generation, but by tracking safety risk token by token during training.

We repurpose guardrail models like 🛡️ Llama Guard and Granite Guardian to score evolving risk across each response 📉 — giving
Xuandong Zhao (@xuandongzhao) 's Twitter Profile Photo

🚀 Excited to share the most inspiring work I’ve been part of this year: "Learning to Reason without External Rewards" TL;DR: We show that LLMs can learn complex reasoning without access to ground-truth answers, simply by optimizing their own internal sense of confidence. 1/n

🚀 Excited to share the most inspiring work I’ve been part of this year:
 
"Learning to Reason without External Rewards"

TL;DR: We show that LLMs can learn complex reasoning without access to ground-truth answers, simply by optimizing their own internal sense of confidence. 1/n
Pin-Yu Chen (@pinyuchentw) 's Twitter Profile Photo

Your LLM Guard Model is secretly a reliable LLM-finetuning-guardrail! IBM Granite Guardian and LLAMA Guard are particularly suited to tracking harmful levels of fine-tuning data at the token level and making training adjustments during fine-tuning Paper: arxiv.org/abs/2505.17196

Your LLM Guard Model is secretly a reliable LLM-finetuning-guardrail!

IBM Granite Guardian and LLAMA Guard are particularly suited to tracking harmful levels of fine-tuning data at the token level and making training adjustments during fine-tuning

Paper: arxiv.org/abs/2505.17196
Stella Li (@stellalisy) 's Twitter Profile Photo

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

🤯 We cracked RLVR with... Random Rewards?!
Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:
- Random rewards: +21%
- Incorrect rewards: +25%
- (FYI) Ground-truth rewards: + 28.8%
How could this even work⁉️ Here's why: 🧵
Blogpost: tinyurl.com/spurious-rewar…
Peter Henderson (@peterhndrsn) 's Twitter Profile Photo

The next ~1-4 years will be taking the 2017-2020 years of Deep RL and scaling up: exploration, generalization, long-horizon tasks, credit assignment, continual learning, multi-agent interaction! Lots of cool work to be done! 🎮🤖 But we shouldn't forget big lessons from back

The next ~1-4 years will be taking the 2017-2020 years of Deep RL and scaling up: exploration, generalization, long-horizon tasks, credit assignment, continual learning, multi-agent interaction! Lots of cool work to be done! 🎮🤖

But we shouldn't forget big lessons from back
Peter Henderson (@peterhndrsn) 's Twitter Profile Photo

Our tracker of “hallucinated” or nonexistent citations in real-world legal contexts has reached over 140 cases across the world. There’s a notable spike in the last 6 months. 📈📈📈

Our tracker of “hallucinated” or nonexistent citations in real-world legal contexts has reached over 140 cases across the world. There’s a notable spike in the last 6 months. 📈📈📈
Ahmad Beirami @ ICLR 2025 (@abeirami) 's Twitter Profile Photo

So far, RL in LLMs has been "RL as a distillation method". - RL helped us distill great verifiers (e.g., code) in the model. - When models are better at verification than generation, we used RL to distill those abilities back to the model. That's about to change with agents!