Terry Yue Zhuo (@terryyuezhuo) 's Twitter Profile
Terry Yue Zhuo

@terryyuezhuo

@BigComProject (non-profit, looking for compute sponsorship) Lead. IBM PhD fellow (2024-2025). Vibe Coding: swe-arena.com

ID: 1266233448521297926

linkhttps://terryyz.github.io calendar_today29-05-2020 05:02:42

1,1K Tweet

1,1K Followers

621 Following

meg.ai 🇨🇦 (@meganrisdal) 's Twitter Profile Photo

The NeurIPS Conference 2025 Datasets & Benchmarks track CFP is open! Good datasets are essential to progress in AI so I'm thrilled to be a co-chair this year and with my fellow colleagues share some improvements to the submission and review process that raise the bar for standards of

The <a href="/NeurIPSConf/">NeurIPS Conference</a>  2025 Datasets &amp; Benchmarks track CFP is open! Good datasets are essential to progress in AI so I'm thrilled to be a co-chair this year and with my fellow colleagues share some improvements to the submission and review process that raise the bar for standards of
Xeophon (@thexeophon) 's Twitter Profile Photo

Super excited to share what I've been working on 👀 You know the struggle: You want to use a project, but it has a GPL or CC-by-NC license 😭 We worked hard and our AI-based agents convert any repo to an MIT-licensed one 🚀🚀🚀 Comment "Want" to get early access 👇🏼

Super excited to share what I've been working on 👀

You know the struggle: You want to use a project, but it has a GPL or CC-by-NC license 😭

We worked hard and our AI-based agents convert any repo to an MIT-licensed one 🚀🚀🚀

Comment "Want" to get early access 👇🏼
Terry Yue Zhuo (@terryyuezhuo) 's Twitter Profile Photo

Llama-4 Series on BigCodeBench-Hard *Inference via NVIDIA NIM Llama-4 Maverick Ranked 41th/192 Similar to Gemini-2.0-Flash-Thinking & GPT-4o-2024-05-13 29.1% Complete 25% Instruct Llama-4-Scout Ranked 97th/192 16.9% Complete 16.9% Instruct Also, new visuals on the leaderboard!

Llama-4 Series on BigCodeBench-Hard
*Inference via NVIDIA NIM

Llama-4 Maverick Ranked 41th/192
Similar to Gemini-2.0-Flash-Thinking &amp; GPT-4o-2024-05-13
29.1% Complete
25% Instruct

Llama-4-Scout Ranked 97th/192
16.9% Complete
16.9% Instruct

Also, new visuals on the leaderboard!
Terry Yue Zhuo (@terryyuezhuo) 's Twitter Profile Photo

Llama-4 Maverick on BigCodeBench-Full 61.9% Complete 49.7% Instruct Average 55.8% Both GPT-4o-2024-05-13 & DeepSeek V3 got 56.1% on average. There may be some gaps between Llama-4 Maverick and the recent (3-month) frontier models, given the fast pace in AI development these

Brendan Dolan-Gavitt (@moyix) 's Twitter Profile Photo

I'm beginning to realize that even if you can find 100,000 vulns in a day with your AI superhacker, the bigger bottleneck by far is going to be figuring out how to report them all, follow up with vendors, etc.

Fan Zhou✈️ICLR2025 (@fazhou_998) 's Twitter Profile Photo

🥁🥁 Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024’s summer, and we finally deliver: the largest math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!

🥁🥁
Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024’s summer, and we finally deliver: the largest math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!
Terry Yue Zhuo (@terryyuezhuo) 's Twitter Profile Photo

DeepCoder-14B on BigCodeBench-Hard Prefilling w/o Reasoning (Ranked 81th/195) 22.3% Complete 18.2% Instruct 20.3% on Average No Prefilling, w/ Reasoning (Ranked 87th/195) 22.3% Complete 16.9% Instruct 19.6% on Average o1 (reasoning=high) & o3 (reasoning=medium) -- 35.5% on

Terry Yue Zhuo (@terryyuezhuo) 's Twitter Profile Photo

Optimus Alpha on BigCodeBench-Hard (Ranked 12th/196) 35.1% Complete 30.4% instruct 32.8% Average Compared to Optimus Alpha, Quasar Alpha is 34.8% Average, ranked 6th. More results: bigcode-bench.github.io

Optimus Alpha on BigCodeBench-Hard
(Ranked 12th/196)

35.1% Complete
30.4% instruct
32.8% Average

Compared to Optimus Alpha, Quasar Alpha is 34.8%  Average, ranked 6th.

More results: bigcode-bench.github.io
Terry Yue Zhuo (@terryyuezhuo) 's Twitter Profile Photo

Using the recommended setup of the original DeepCoder (64k output tokens, 0.6 temperature), DeepCoder ranked 78th/196 on BigCodeBench-Hard 23% Complete 18.2% Instruct 20.6% Average DeepCoder with prefilling (no reasoning traces) got 20.3% Average. cc Grad Wenhu Chen

Using the recommended setup of the original DeepCoder (64k output tokens, 0.6 temperature), DeepCoder ranked 78th/196 on BigCodeBench-Hard

23% Complete 
18.2% Instruct
20.6% Average

DeepCoder with prefilling (no reasoning traces) got 20.3% Average.

cc <a href="/Grad62304977/">Grad</a> <a href="/WenhuChen/">Wenhu Chen</a>
Fan Zhou✈️ICLR2025 (@fazhou_998) 's Twitter Profile Photo

If you're looking to boost your model’s math reasoning ability, don’t miss out on MegaMath —— The Largest Open Math Pre-training in the world! 🧠 Need from-scratch training? Use MegaMath. 🔁 Need continual pre-training? Use MegaMath. 🧬 Need high-quality mid-training? Use

If you're looking to boost your model’s math reasoning ability, don’t miss out on MegaMath —— The Largest Open Math Pre-training in the world!

🧠 Need from-scratch training? Use MegaMath. 
🔁 Need continual pre-training? Use MegaMath. 
🧬 Need high-quality mid-training? Use
Zijian Wang @ ICLR 🇸🇬 (@zijianwang30) 's Twitter Profile Photo

Excited to be at #ICLR2025 hosting the Deep Learning for Code (#DL4C/@dl4code) workshop again! We've lined up an exciting program with diverse speakers in the AI for code space. Hope to see you there! DM me if you want to chat about code LLMs and coding agents!

Daytona.io (@daytonaio) 's Twitter Profile Photo

Introducing Daytona Cloud: the first agent-native cloud infrastructure. Fast. Stateful. Built for agents, not humans. Available today.

Fan Zhou✈️ICLR2025 (@fazhou_998) 's Twitter Profile Photo

⚠️ Personal Thought Recently feeling bullish about agentic AI. Seeing SWE-bench scores surpass 50—even 60—is truly exciting. 🧐 But looking closer: if we're all using agentless frameworks instead of building environments where LMs can truly explore, can we really say the era of