John Yang (@jyangballin) Twitter Tweets • TwiCopy

Mike A. Merrill

3 months ago

Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr

thumb_up_off_alt220

chat_bubble_outline14

repeat57

shareShare

Kilian Lieret @ICLR

@klieret

3 months ago

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.

thumb_up_off_alt84

chat_bubble_outline4

repeat13

shareShare

Alex Zhang

@a1zhang

3 months ago

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

thumb_up_off_alt518

chat_bubble_outline23

repeat71

shareShare

Qinan Yu

@qinan_yu

3 months ago

🎀 fine-grained, interpretable representation steering for LMs! meet RePS — Reference-free Preference Steering! 1⃣ outperforms existing methods on 2B-27B LMs, nearly matching prompting 2⃣ supports both steering and suppression (beat system prompts!) 3⃣ jailbreak-proof (1/n)

thumb_up_off_alt212

chat_bubble_outline1

repeat35

shareShare

CLS

@chengleisi

3 months ago

This year, there have been various pieces of evidence that AI agents are starting to be able to conduct scientific research and produce papers end-to-end, at a level where some of these generated papers were already accepted by top-tier conferences/workshops. Intology’s

thumb_up_off_alt212

chat_bubble_outline13

repeat43

shareShare

John Yang

@jyangballin

3 months ago

To find "good" GitHub repositories (good = well structured, lots of activity) for some language, I just use GitHub search (e.g. `language:go`), click "repositories", then sort search results by "Most stars". Feels kind of primitive, are there better ways to do this?

thumb_up_off_alt16

chat_bubble_outline3

repeat0

shareShare

Ludwig Schmidt

@lschmidt3

3 months ago

Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.

thumb_up_off_alt1,1K

chat_bubble_outline20

repeat208

shareShare

Omar Shaikh

@oshaikh13

3 months ago

What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵

thumb_up_off_alt181

chat_bubble_outline12

repeat57

shareShare

Ben Shi

@benshi34

3 months ago

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

thumb_up_off_alt177

chat_bubble_outline6

repeat39

shareShare

Kol Tregaskes

@koltregaskes

3 months ago

The top SWE agent is not Cursor or Windsurf, it's two tools that can be downloaded from GitHub; OpenHands (All Hands AI) and SWE-Agent. Btw SWE-Agent does have a X handle but looks faked or hacked. Check link below to the LiveSWEBench benchmark and the links to the real agents.

The top SWE agent is not Cursor or Windsurf, it's two tools that can be downloaded from GitHub; OpenHands (<a href="/allhands_ai/">All Hands AI</a>) and SWE-Agent.

Btw SWE-Agent does have a X handle but looks faked or hacked. Check link below to the LiveSWEBench benchmark and the links to the real agents.

thumb_up_off_alt47

chat_bubble_outline2

repeat9

shareShare

Yijia Shao

@echoshao8899

3 months ago

🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody asks them what they want. While AI races to automate everything, we took a different approach: auditing what workers want vs. what AI can do across the US workforce.🧵

thumb_up_off_alt280

chat_bubble_outline6

repeat47

shareShare

Yutong Zhang

@zhangyt0704

2 months ago

AI companions aren’t science fiction anymore 🤖💬❤️ Thousands are turning to AI chatbots for emotional connection – finding comfort, sharing secrets, and even falling in love. But as AI companionship grows, the line between real and artificial relationships blurs. 📰 “Can A.I.

thumb_up_off_alt174

chat_bubble_outline4

repeat53

shareShare

John Yang

@jyangballin

2 months ago

If you wanna stay up to date with SWE-bench leaderboard updates, follow our new Twitter account! And if you're bored of SWE-bench Verified, check out SWE-bench Multimodal; +25% progress over the last 9 months.

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

David Hall

@dlwh

2 months ago

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

thumb_up_off_alt968

chat_bubble_outline21

repeat94

shareShare

CLS

@chengleisi

2 months ago

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

thumb_up_off_alt553

chat_bubble_outline10

repeat162

shareShare

Diyi Yang

@diyi_yang

2 months ago

Our study led by CLS reveals an “ideation–execution gap” 😲 Ideas from LLMs may sound novel, but when experts spend 100+ hrs executing them, they flop: 💥 👉 human‑generated ideas outperform on novelty, excitement, effectiveness & overall quality!

thumb_up_off_alt142

chat_bubble_outline5

repeat25

shareShare

Ori Press

@ori_press

2 months ago

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

thumb_up_off_alt141

chat_bubble_outline6

repeat54

shareShare

Talor Abramovich

@abramovichtalor

2 months ago

Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks. youtube.com/watch?v=50zkWJ…

thumb_up_off_alt23

chat_bubble_outline3

repeat5

shareShare

SWE-bench

@swebench

2 months ago

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️

thumb_up_off_alt14

chat_bubble_outline1

repeat6

shareShare

Keyon Vafa

@keyonv

2 months ago

Can an AI model predict perfectly and still have a terrible world model? What would that even mean? Our new ICML paper formalizes these questions One result tells the story: A transformer trained on 10M solar systems nails planetary orbits. But it botches gravitational laws 🧵

thumb_up_off_alt6,6K

chat_bubble_outline198

repeat938

shareShare