Qian Liu (@sivil_taram) 's Twitter Profile
Qian Liu

@sivil_taram

Researcher @ TikTok πŸ‡ΈπŸ‡¬

πŸ“„ Sailor / StarCoder / OpenCoder
πŸ’Ό Past: Research Scientist @SeaAIL; PhD @MSFTResearch
🧠 Contribution: @XlangNLP @BigCodeProject

ID: 1465140087193161734

linkhttp://siviltaram.github.io/ calendar_today29-11-2021 02:06:42

1,1K Tweet

3,3K Followers

674 Following

Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

Is text-only information enough for LLM/VLM Web Agents? πŸ€” Clearly not. πŸ™…β€β™‚οΈ The modern web is a rich tapestry of text, images πŸ–ΌοΈ, and videos πŸŽ₯. To truly assist us, agents need to understand it all. That's why we built MM-BrowseComp. 🌐 We're introducing MM-BrowseComp πŸš€, a new

Is text-only information enough for LLM/VLM Web Agents? πŸ€” Clearly not. πŸ™…β€β™‚οΈ The modern web is a rich tapestry of text, images πŸ–ΌοΈ, and videos πŸŽ₯. To truly assist us, agents need to understand it all. That's why we built MM-BrowseComp. 🌐

We're introducing MM-BrowseComp πŸš€, a new
Dynamics Lab (@dynamicslab_ai) 's Twitter Profile Photo

Introducing Mirage 2 β€” a real-time, general-domain generative world engine you can play online Upload any imageβ€”photos, concept art, classic paintings, kids' drawingsβ€”and step into it as a live, interactive world. Prompt your worlds with text to create any surreal scenes and

Jia Guo (@jia__guo) 's Twitter Profile Photo

πŸš€ Is more data always better for your RL training? Not sure how to pick the β€œright” data? Check out our latest research!

Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

Although I won't be able to be onsite personally. Glad to announce that M-A-P is co-organizing a meetup with Monolith, alongside co-hosts from the verl, SGLang, Zilliz, and Creao AI dev teams, to explore the latest advances in RL, RL infrastructure, reasoning, and agentic AI in

Although I won't be able to be onsite personally.  Glad to announce that M-A-P is co-organizing a meetup with Monolith, alongside co-hosts from the verl, SGLang, Zilliz, and Creao AI dev teams, to explore the latest advances in RL, RL infrastructure, reasoning, and agentic AI in
Michael Qizhe Shieh (@mpulsewidth) 's Twitter Profile Photo

Introducing MCPMark, a collaboration with Eval Sys and LobeHub! We created a challenging benchmark to stress-test MCP use in comprehensive contexts. - 127 high-quality data samples created by experts. - GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the

Introducing MCPMark, a collaboration with <a href="/EvalSysOrg/">Eval Sys</a> and <a href="/lobehub/">LobeHub</a>! 

We created a challenging benchmark to stress-test MCP use in comprehensive contexts.
- 127 high-quality data samples created by experts.
- GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the
Deep Learning For Code @ ICLR'25 (@dl4code) 's Twitter Profile Photo

🚨 FINAL CALL: Only 2 days left to submit to the 𝔻𝕖𝕖𝕑 π•ƒπ•–π•’π•£π•Ÿπ•šπ•Ÿπ•˜ 𝕗𝕠𝕣 ℂ𝕠𝕕𝕖 π•šπ•Ÿ π•₯𝕙𝕖 π”Έπ•˜π•–π•Ÿπ•₯π•šπ•” 𝔼𝕣𝕒 (DL4C) workshop at NeurIPS2025 ! πŸ—“Deadline: Aug 27th, 11:59PM UTC-12 Amazing speaker lineup including experts from CMU, UC Berkeley, Replit, poolside,

🚨 FINAL CALL: Only 2 days left to submit to the 𝔻𝕖𝕖𝕑 π•ƒπ•–π•’π•£π•Ÿπ•šπ•Ÿπ•˜ 𝕗𝕠𝕣 ℂ𝕠𝕕𝕖 π•šπ•Ÿ π•₯𝕙𝕖 π”Έπ•˜π•–π•Ÿπ•₯π•šπ•” 𝔼𝕣𝕒 (DL4C) workshop at NeurIPS2025 !

πŸ—“Deadline: Aug 27th, 11:59PM UTC-12

Amazing speaker lineup including experts from CMU, UC Berkeley, Replit, poolside,
Yizhi Li (@yizhilll) 's Twitter Profile Photo

[1/n] Introducing TreePO🌲, a new RL framework for LLMs! It slashes sampling costs while boosting reasoning capabilities. Daily Paper: huggingface.co/papers/2508.17…

AK (@_akhaliq) 's Twitter Profile Photo

TreePO Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

TreePO

Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
Qian Liu (@sivil_taram) 's Twitter Profile Photo

Thanks AK for sharing our work! πŸ”₯ 🧡 Back to Jan when we just started this project... we were living a nightmare 😩 Months of watching our multi-turn RL models collapse. Every. Single. Time. πŸ’₯ We thought we were doing something wrong... until we discovered other research

Junxian He (@junxian_he) 's Twitter Profile Photo

Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. 🧐These approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which

Mirage or method? We re-assess a series of RL observations such as spurious reward, one-shot RL, test-time RL, and negative-sample training. 

🧐These approaches were all proved on Qwen+Math combination originally, but do they work in other settings? If not, under which
Yingru Li (@richardyrli) 's Twitter Profile Photo

The SimpleTIR paper is officially out! We go beyond our July blog post to provide a deeper mathematical explanation and rigorous proof for why multi-turn RL agents are so unstable. The root cause? A predictable domino effect: OOD Tool Feedback β†’ Low-Prob Tokens β†’ Exploding

Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

Computer Use: Modern Moravec's Paradox A new blog post arguing why computer-use agents may be the biggest opportunity and challenge for AGI. tinyurl.com/computer-use-a… Table of Contents > Moravec’s Paradox > Moravec's Paradox in 2025 > Computer use may be the biggest opportunity

Computer Use: Modern Moravec's Paradox

A new blog post arguing why computer-use agents may be the biggest opportunity and challenge for AGI.

tinyurl.com/computer-use-a…

Table of Contents
&gt; Moravec’s Paradox
&gt; Moravec's Paradox in 2025
&gt; Computer use may be the biggest opportunity
Ge Zhang (@gezhang86038849) 's Twitter Profile Photo

πŸŒ€ β€œDo As I Say, Not As You Were Trained!” β€” problem solved? ❌ Not by today’s LLMs. We present Inverse IFEval: a new benchmark testing whether LLMs can follow counterintuitive instructions that deliberately break away from standard training patterns. πŸ“Š Dataset:

πŸŒ€ β€œDo As I Say, Not As You Were Trained!” β€” problem solved? ❌ Not by today’s LLMs.
We present Inverse IFEval: a new benchmark testing whether LLMs can follow counterintuitive instructions that deliberately break away from standard training patterns.
πŸ“Š Dataset:
Qian Liu (@sivil_taram) 's Twitter Profile Photo

πŸ€” Is your LLM actually listening to you? Or just parroting its training data? New research shows that LLMs might be WAY more "stubborn" than you think! Checkout the thread for more details ⬇️

𝚐π”ͺ𝟾𝚑𝚑𝟾 (@gm8xx8) 's Twitter Profile Photo

Mini-o3: Reproducing OpenAI o3-style multi-turn visual reasoning. Unlike prior VLMs stuck at 1–2 turns, Mini-o3 executes deep tool-based reasoning spanning tens of steps. What it proves is that the right data, init, and an RL tweak unlock long-horizon visual search, without

Mini-o3: Reproducing OpenAI o3-style multi-turn visual reasoning. Unlike prior VLMs stuck at 1–2 turns, Mini-o3 executes deep tool-based reasoning spanning tens of steps. What it proves is that the right data, init, and an RL tweak unlock long-horizon visual search, without
Qian Liu (@sivil_taram) 's Twitter Profile Photo

Trained on a 6-turn cap but naturally scales to 32 turns at inference, with accuracy improving the deeper it thinks. Great work from Xin Lai and the team!