Terry Yue Zhuo (@terryyuezhuo) Twitter Tweets • TwiCopy

Terry Yue Zhuo

@terryyuezhuo

+ Follow

@BigComProject (non-profit, looking for compute sponsorship) Lead. IBM PhD fellow (2024-2025). Vibe Coding: swe-arena.com

ID: 1266233448521297926

linkhttps://terryyz.github.io calendar_today29-05-2020 05:02:42

1,1K Tweet

1,1K Followers

621 Following

meg.ai 🇨🇦

@meganrisdal

8 months ago

The NeurIPS Conference 2025 Datasets & Benchmarks track CFP is open! Good datasets are essential to progress in AI so I'm thrilled to be a co-chair this year and with my fellow colleagues share some improvements to the submission and review process that raise the bar for standards of

The <a href="/NeurIPSConf/">NeurIPS Conference</a> 2025 Datasets & Benchmarks track CFP is open! Good datasets are essential to progress in AI so I'm thrilled to be a co-chair this year and with my fellow colleagues share some improvements to the submission and review process that raise the bar for standards of

thumb_up_off_alt137

chat_bubble_outline6

repeat29

shareShare

Xeophon

@thexeophon

8 months ago

Super excited to share what I've been working on 👀 You know the struggle: You want to use a project, but it has a GPL or CC-by-NC license 😭 We worked hard and our AI-based agents convert any repo to an MIT-licensed one 🚀🚀🚀 Comment "Want" to get early access 👇🏼

thumb_up_off_alt633

chat_bubble_outline177

repeat33

shareShare

Terry Yue Zhuo

@terryyuezhuo

8 months ago

Llama-4 Series on BigCodeBench-Hard *Inference via NVIDIA NIM Llama-4 Maverick Ranked 41th/192 Similar to Gemini-2.0-Flash-Thinking & GPT-4o-2024-05-13 29.1% Complete 25% Instruct Llama-4-Scout Ranked 97th/192 16.9% Complete 16.9% Instruct Also, new visuals on the leaderboard!

thumb_up_off_alt68

chat_bubble_outline3

repeat10

shareShare

Terry Yue Zhuo

@terryyuezhuo

8 months ago

Llama-4 Maverick on BigCodeBench-Full 61.9% Complete 49.7% Instruct Average 55.8% Both GPT-4o-2024-05-13 & DeepSeek V3 got 56.1% on average. There may be some gaps between Llama-4 Maverick and the recent (3-month) frontier models, given the fast pace in AI development these

thumb_up_off_alt12

chat_bubble_outline0

repeat0

shareShare

Brendan Dolan-Gavitt

@moyix

8 months ago

I'm beginning to realize that even if you can find 100,000 vulns in a day with your AI superhacker, the bigger bottleneck by far is going to be figuring out how to report them all, follow up with vendors, etc.

thumb_up_off_alt70

chat_bubble_outline8

repeat5

shareShare

Terry Yue Zhuo

@terryyuezhuo

8 months ago

Hear me out Llama 4 is not bad, but your expectations may be too high and should be slightly calibrated.

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Fan Zhou✈️ICLR2025

@fazhou_998

8 months ago

🥁🥁 Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024’s summer, and we finally deliver: the largest math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!

thumb_up_off_alt138

chat_bubble_outline2

repeat33

shareShare

Terry Yue Zhuo

@terryyuezhuo

8 months ago

Prove Me Wrong: Chatbot Arena is a billboard FULL STOP

thumb_up_off_alt6

chat_bubble_outline2

repeat1

shareShare

Terry Yue Zhuo

@terryyuezhuo

7 months ago

DeepCoder-14B on BigCodeBench-Hard Prefilling w/o Reasoning (Ranked 81th/195) 22.3% Complete 18.2% Instruct 20.3% on Average No Prefilling, w/ Reasoning (Ranked 87th/195) 22.3% Complete 16.9% Instruct 19.6% on Average o1 (reasoning=high) & o3 (reasoning=medium) -- 35.5% on

thumb_up_off_alt38

chat_bubble_outline5

repeat8

shareShare

Terry Yue Zhuo

@terryyuezhuo

7 months ago

Quasar Alpha via OpenRouter API on BigCodeBech-Hard (Ranked 6th/195) 37.8% Complete 31.8% Instruct 34.8% Average

thumb_up_off_alt8

chat_bubble_outline1

repeat0

shareShare

Terry Yue Zhuo

@terryyuezhuo

7 months ago

Optimus Alpha on BigCodeBench-Hard (Ranked 12th/196) 35.1% Complete 30.4% instruct 32.8% Average Compared to Optimus Alpha, Quasar Alpha is 34.8% Average, ranked 6th. More results: bigcode-bench.github.io

thumb_up_off_alt21

chat_bubble_outline3

repeat2

shareShare

Terry Yue Zhuo

@terryyuezhuo

7 months ago

Using the recommended setup of the original DeepCoder (64k output tokens, 0.6 temperature), DeepCoder ranked 78th/196 on BigCodeBench-Hard 23% Complete 18.2% Instruct 20.6% Average DeepCoder with prefilling (no reasoning traces) got 20.3% Average. cc Grad Wenhu Chen

thumb_up_off_alt25

chat_bubble_outline3

repeat1

shareShare

Fan Zhou✈️ICLR2025

@fazhou_998

7 months ago

If you're looking to boost your model’s math reasoning ability, don’t miss out on MegaMath —— The Largest Open Math Pre-training in the world! 🧠 Need from-scratch training? Use MegaMath. 🔁 Need continual pre-training? Use MegaMath. 🧬 Need high-quality mid-training? Use

thumb_up_off_alt31

chat_bubble_outline1

repeat9

shareShare

Zijian Wang @ ICLR 🇸🇬

@zijianwang30

7 months ago

Excited to be at #ICLR2025 hosting the Deep Learning for Code (#DL4C/@dl4code) workshop again! We've lined up an exciting program with diverse speakers in the AI for code space. Hope to see you there! DM me if you want to chat about code LLMs and coding agents!

thumb_up_off_alt19

chat_bubble_outline0

repeat2

shareShare

Daytona.io

@daytonaio

7 months ago

Introducing Daytona Cloud: the first agent-native cloud infrastructure. Fast. Stateful. Built for agents, not humans. Available today.

thumb_up_off_alt579

chat_bubble_outline77

repeat245

shareShare

Fan Zhou✈️ICLR2025

@fazhou_998

5 months ago

⚠️ Personal Thought Recently feeling bullish about agentic AI. Seeing SWE-bench scores surpass 50—even 60—is truly exciting. 🧐 But looking closer: if we're all using agentless frameworks instead of building environments where LMs can truly explore, can we really say the era of

thumb_up_off_alt59

chat_bubble_outline3

repeat9

shareShare