John Yang (@jyangballin) 's Twitter Profile
John Yang

@jyangballin

🌲 CS PhD @Stanford
🤖 SWE-bench + agent
🎓 Prev. @princeton_nlp 🐯, @Berkeley_EECS 🐻

ID: 616998786

linkhttps://john-b-yang.github.io/ calendar_today24-06-2012 08:31:40

319 Tweet

3,3K Followers

675 Following

Mike A. Merrill (@mike_a_merrill) 's Twitter Profile Photo

Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr

Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? 

We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr
Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.
Alex Zhang (@a1zhang) 's Twitter Profile Photo

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

Qinan Yu (@qinan_yu) 's Twitter Profile Photo

🎀 fine-grained, interpretable representation steering for LMs! meet RePS — Reference-free Preference Steering! 1⃣ outperforms existing methods on 2B-27B LMs, nearly matching prompting 2⃣ supports both steering and suppression (beat system prompts!) 3⃣ jailbreak-proof (1/n)

🎀 fine-grained, interpretable representation steering for LMs!
meet RePS — Reference-free Preference Steering!

1⃣ outperforms existing methods on 2B-27B LMs, nearly matching prompting
2⃣ supports both steering and suppression (beat system prompts!)
3⃣ jailbreak-proof

(1/n)
CLS (@chengleisi) 's Twitter Profile Photo

This year, there have been various pieces of evidence that AI agents are starting to be able to conduct scientific research and produce papers end-to-end, at a level where some of these generated papers were already accepted by top-tier conferences/workshops. Intology’s

John Yang (@jyangballin) 's Twitter Profile Photo

To find "good" GitHub repositories (good = well structured, lots of activity) for some language, I just use GitHub search (e.g. `language:go`), click "repositories", then sort search results by "Most stars". Feels kind of primitive, are there better ways to do this?

To find "good" GitHub repositories (good = well structured, lots of activity) for some language, I just use GitHub search (e.g. `language:go`), click "repositories", then sort search results by "Most stars".

Feels kind of primitive, are there better ways to do this?
Ludwig Schmidt (@lschmidt3) 's Twitter Profile Photo

Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.

Very excited to finally release our paper for OpenThoughts!

After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.
Omar Shaikh (@oshaikh13) 's Twitter Profile Photo

What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵

Ben Shi (@benshi34) 's Twitter Profile Photo

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes?

In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:
Kol Tregaskes (@koltregaskes) 's Twitter Profile Photo

The top SWE agent is not Cursor or Windsurf, it's two tools that can be downloaded from GitHub; OpenHands (All Hands AI) and SWE-Agent. Btw SWE-Agent does have a X handle but looks faked or hacked. Check link below to the LiveSWEBench benchmark and the links to the real agents.

The top SWE agent is not Cursor or Windsurf, it's two tools that can be downloaded from GitHub; OpenHands (<a href="/allhands_ai/">All Hands AI</a>) and SWE-Agent.

Btw SWE-Agent does have a X handle but looks faked or hacked.  Check link below to the LiveSWEBench benchmark and the links to the real agents.
Yijia Shao (@echoshao8899) 's Twitter Profile Photo

🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody asks them what they want. While AI races to automate everything, we took a different approach: auditing what workers want vs. what AI can do across the US workforce.🧵

🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody asks them what they want.

While AI races to automate everything, we took a different approach: auditing what workers want vs. what AI can do across the US workforce.🧵
Yutong Zhang (@zhangyt0704) 's Twitter Profile Photo

AI companions aren’t science fiction anymore 🤖💬❤️ Thousands are turning to AI chatbots for emotional connection – finding comfort, sharing secrets, and even falling in love. But as AI companionship grows, the line between real and artificial relationships blurs. 📰 “Can A.I.

AI companions aren’t science fiction anymore 🤖💬❤️
Thousands are turning to AI chatbots for emotional connection – finding comfort, sharing secrets, and even falling in love. But as AI companionship grows, the line between real and artificial relationships blurs.

📰 “Can A.I.
John Yang (@jyangballin) 's Twitter Profile Photo

If you wanna stay up to date with SWE-bench leaderboard updates, follow our new Twitter account! And if you're bored of SWE-bench Verified, check out SWE-bench Multimodal; +25% progress over the last 9 months.

David Hall (@dlwh) 's Twitter Profile Photo

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )
CLS (@chengleisi) 's Twitter Profile Photo

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

Are AI scientists already better than human researchers?

We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts.

Main finding: LLM ideas result in worse projects than human ideas.
Diyi Yang (@diyi_yang) 's Twitter Profile Photo

Our study led by CLS reveals an “ideation–execution gap” 😲 Ideas from LLMs may sound novel, but when experts spend 100+ hrs executing them, they flop: 💥 👉 human‑generated ideas outperform on novelty, excitement, effectiveness & overall quality!

Ori Press (@ori_press) 's Twitter Profile Photo

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

Do language models have algorithmic creativity?

To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
Talor Abramovich (@abramovichtalor) 's Twitter Profile Photo

Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks. youtube.com/watch?v=50zkWJ…

SWE-bench (@swebench) 's Twitter Profile Photo

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️

SWE-agent is now Multimodal! 😎
We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 
🔗➡️
Keyon Vafa (@keyonv) 's Twitter Profile Photo

Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵