Qian Liu (@sivil_taram) Twitter Tweets • TwiCopy

Qian Liu

@sivil_taram

+ Follow

Researcher @ TikTok 🇸🇬

📄 Sailor / StarCoder / OpenCoder
💼 Past: Research Scientist @SeaAIL; PhD @MSFTResearch
🧠 Contribution: @XlangNLP @BigCodeProject

ID: 1465140087193161734

linkhttp://siviltaram.github.io/ calendar_today29-11-2021 02:06:42

1,1K Tweet

3,3K Followers

674 Following

Ge Zhang

@gezhang86038849

2 months ago

Is text-only information enough for LLM/VLM Web Agents? 🤔 Clearly not. 🙅‍♂️ The modern web is a rich tapestry of text, images 🖼️, and videos 🎥. To truly assist us, agents need to understand it all. That's why we built MM-BrowseComp. 🌐 We're introducing MM-BrowseComp 🚀, a new

thumb_up_off_alt85

chat_bubble_outline1

repeat30

shareShare

Dynamics Lab

@dynamicslab_ai

2 months ago

Introducing Mirage 2 — a real-time, general-domain generative world engine you can play online Upload any image—photos, concept art, classic paintings, kids' drawings—and step into it as a live, interactive world. Prompt your worlds with text to create any surreal scenes and

thumb_up_off_alt2,2K

chat_bubble_outline118

repeat312

shareShare

Jia Guo

@jia__guo

2 months ago

🚀 Is more data always better for your RL training? Not sure how to pick the “right” data? Check out our latest research!

thumb_up_off_alt10

chat_bubble_outline0

repeat2

shareShare

Ge Zhang

@gezhang86038849

2 months ago

Although I won't be able to be onsite personally. Glad to announce that M-A-P is co-organizing a meetup with Monolith, alongside co-hosts from the verl, SGLang, Zilliz, and Creao AI dev teams, to explore the latest advances in RL, RL infrastructure, reasoning, and agentic AI in

thumb_up_off_alt19

chat_bubble_outline0

repeat7

shareShare

Michael Qizhe Shieh

@mpulsewidth

2 months ago

Introducing MCPMark, a collaboration with Eval Sys and LobeHub! We created a challenging benchmark to stress-test MCP use in comprehensive contexts. - 127 high-quality data samples created by experts. - GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the

Introducing MCPMark, a collaboration with <a href="/EvalSysOrg/">Eval Sys</a> and <a href="/lobehub/">LobeHub</a>!

We created a challenging benchmark to stress-test MCP use in comprehensive contexts.
- 127 high-quality data samples created by experts.
- GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the

thumb_up_off_alt157

chat_bubble_outline4

repeat49

shareShare

Qian Liu

@sivil_taram

2 months ago

TIPS has been remarkably effective in my practice —more people should definitely know about it!

thumb_up_off_alt22

chat_bubble_outline1

repeat2

shareShare

Deep Learning For Code @ ICLR'25

@dl4code

2 months ago

🚨 FINAL CALL: Only 2 days left to submit to the 𝔻𝕖𝕖𝕡 𝕃𝕖𝕒𝕣𝕟𝕚𝕟𝕘 𝕗𝕠𝕣 ℂ𝕠𝕕𝕖 𝕚𝕟 𝕥𝕙𝕖 𝔸𝕘𝕖𝕟𝕥𝕚𝕔 𝔼𝕣𝕒 (DL4C) workshop at NeurIPS2025 ! 🗓Deadline: Aug 27th, 11:59PM UTC-12 Amazing speaker lineup including experts from CMU, UC Berkeley, Replit, poolside,

thumb_up_off_alt16

chat_bubble_outline0

repeat3

shareShare

Qian Liu

@sivil_taram

2 months ago

Come to submit your agent work!

thumb_up_off_alt18

chat_bubble_outline0

repeat1

shareShare

Yizhi Li

@yizhilll

2 months ago

[1/n] Introducing TreePO🌲, a new RL framework for LLMs! It slashes sampling costs while boosting reasoning capabilities. Daily Paper: huggingface.co/papers/2508.17…

thumb_up_off_alt55

chat_bubble_outline2

repeat11

shareShare

AK

@_akhaliq

2 months ago

TreePO Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

thumb_up_off_alt50

chat_bubble_outline1

repeat8

shareShare

AK

@_akhaliq

2 months ago

SimpleTIR End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

thumb_up_off_alt91

chat_bubble_outline4

repeat25

shareShare

Longtao Zheng

@ltzheng01

2 months ago

Thanks AK for sharing our work! TLDR: filtering out trajectories with void turns → stable multi-turn RL training

thumb_up_off_alt15

chat_bubble_outline0

repeat3

shareShare

Qian Liu

@sivil_taram

2 months ago

Thanks AK for sharing our work! 🔥 🧵 Back to Jan when we just started this project... we were living a nightmare 😩 Months of watching our multi-turn RL models collapse. Every. Single. Time. 💥 We thought we were doing something wrong... until we discovered other research

thumb_up_off_alt93

chat_bubble_outline2

repeat17

shareShare

Junxian He

@junxian_he

2 months ago

Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. 🧐These approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which

thumb_up_off_alt186

chat_bubble_outline3

repeat41

shareShare

Yingru Li

@richardyrli

2 months ago

The SimpleTIR paper is officially out! We go beyond our July blog post to provide a deeper mathematical explanation and rigorous proof for why multi-turn RL agents are so unstable. The root cause? A predictable domino effect: OOD Tool Feedback → Low-Prob Tokens → Exploding

thumb_up_off_alt12

chat_bubble_outline0

repeat2

shareShare

Yu Su @#ICLR2025

@ysu_nlp

2 months ago

Computer Use: Modern Moravec's Paradox A new blog post arguing why computer-use agents may be the biggest opportunity and challenge for AGI. tinyurl.com/computer-use-a… Table of Contents > Moravec’s Paradox > Moravec's Paradox in 2025 > Computer use may be the biggest opportunity

thumb_up_off_alt185

chat_bubble_outline9

repeat61

shareShare

Ge Zhang

@gezhang86038849

2 months ago

🌀 “Do As I Say, Not As You Were Trained!” — problem solved? ❌ Not by today’s LLMs. We present Inverse IFEval: a new benchmark testing whether LLMs can follow counterintuitive instructions that deliberately break away from standard training patterns. 📊 Dataset:

thumb_up_off_alt69

chat_bubble_outline1

repeat16

shareShare

Qian Liu

@sivil_taram

2 months ago

🤔 Is your LLM actually listening to you? Or just parroting its training data? New research shows that LLMs might be WAY more "stubborn" than you think! Checkout the thread for more details ⬇️

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

2 months ago

Mini-o3: Reproducing OpenAI o3-style multi-turn visual reasoning. Unlike prior VLMs stuck at 1–2 turns, Mini-o3 executes deep tool-based reasoning spanning tens of steps. What it proves is that the right data, init, and an RL tweak unlock long-horizon visual search, without

thumb_up_off_alt153

chat_bubble_outline2

repeat32

shareShare

Qian Liu

@sivil_taram

2 months ago

Trained on a 6-turn cap but naturally scales to 32 turns at inference, with accuracy improving the deeper it thinks. Great work from Xin Lai and the team!

thumb_up_off_alt9

chat_bubble_outline0

repeat3

shareShare