Jing Yu Koh (@kohjingyu) 's Twitter Profile
Jing Yu Koh

@kohjingyu

Computer control agents @AIatMeta, on leave from ML PhD @CarnegieMellon.

Prev: multimodal research @GoogleAI, undergrad @sutdsg. Opinions my own. 🇸🇬

ID: 52736437

linkhttps://jykoh.com calendar_today01-07-2009 14:16:32

1,1K Tweet

5,5K Followers

373 Following

Jing Yu Koh (@kohjingyu) 's Twitter Profile Photo

When I was 7 I also got stuck in Mt. Moon, and I asked a friend who had already completed the game to take over and help me out. This is probably not dissimilar from how computer use agents will eventually work.

When I was 7 I also got stuck in Mt. Moon, and I asked a friend who had already completed the game to take over and help me out. This is probably not dissimilar from how computer use agents will eventually work.
Jing Yu Koh (@kohjingyu) 's Twitter Profile Photo

I appreciate model runs that are so terrible they're obviously bugged. The hardest experiments to debug are the ones that are only slightly bad.

Jing Yu Koh (@kohjingyu) 's Twitter Profile Photo

I'm super excited about these kinds of multimodal image+text -> image+text models. I think they will unlock a ton of interesting applications. Congrats on the launch!

Jason Baldridge (@jasonbaldridge) 's Twitter Profile Photo

Media theorists Nicolas Malevé and Katrina Sluis interviewed me about my work on image generation at the end of 2023, and the writeup of that interview is finally available! It's a somewhat historical overview with my personal take on various developments.

Media theorists Nicolas Malevé and Katrina Sluis interviewed me about my work on image generation at the end of 2023, and the writeup of that interview is finally available! It's a somewhat historical overview with my personal take on various developments.
Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

🔥2025 is the year of agents, but are we there yet?🤔 🤯 "An Illusion of Progress? Assessing the Current State of Web Agents" –– our new study shows that frontier web agents may be far less competent (up to 59%) than previously reported! Why were benchmark numbers inflated? -

🔥2025 is the year of agents, but are we there yet?🤔

🤯 "An Illusion of Progress? Assessing the Current State of Web Agents" –– our new study shows that frontier web agents may be far less competent (up to 59%) than previously reported!

Why were benchmark numbers inflated?
-
Jacob Springer (@jacspringer) 's Twitter Profile Photo

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!

Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇

1/9
Russ Salakhutdinov (@rsalakhu) 's Twitter Profile Photo

Llama4 models are out! Open sourced! Check them out: “Native multimodality, mixture-of-experts models, super long context windows, step changes in performance, and unparalleled efficiency. All in easy-to-deploy sizes custom fit for how you want to use it” llama.com

Graham Neubig (@gneubig) 's Twitter Profile Photo

A big two days of agents starting tomorrow at CMU (and then two days of agent hackathon after that!) Registration is still open so if you're in or around Pittsburgh come one come all: cmu-agent-workshop.github.io We also plan to livestream for participants who can't make it in person

Xing Han Lu (@xhluca) 's Twitter Profile Photo

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories  

We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.

We find that rule-based evals underreport success rates, and
Christina Baek (@_christinabaek) 's Twitter Profile Photo

Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N

Are current reasoning models optimal for test-time scaling? 🌠
No! Models make the same incorrect guess over and over again.

We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math!

1/N
Jing Yu Koh (@kohjingyu) 's Twitter Profile Photo

Major FOMO about missing ICLR this year. But I hope you all enjoy Singapore, and push for more AI conferences to be held there ;)

John Yang (@jyangballin) 's Twitter Profile Photo

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified.

We built it by synthesizing a ton of agentic training data from 100+ Python repos.

Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
Jing Yu Koh (@kohjingyu) 's Twitter Profile Photo

If you’re serious about building a good model you should do all of your important evals in-house. Implementing and running evals is wildly informative about what data/techniques you should be looking into. This becomes especially true the more non-standard your eval is (e.g.,

Jing Yu Koh (@kohjingyu) 's Twitter Profile Photo

Huh, and I thought computer use agents was easily scooped. Must be stressful to be a PhD student working on reasoning right now.

Keshigeyan Chandrasegaran (@keshigeyan) 's Twitter Profile Photo

1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 grafting.stanford.edu Co-led with Michael Poli

Jing Yu Koh (@kohjingyu) 's Twitter Profile Photo

Everytime an “LLMs don’t do X” paper pops off or wins an award, I think about the guys who won a Nobel Prize in economics for their paper that proved the used car market cannot exist

Jing Yu Koh (@kohjingyu) 's Twitter Profile Photo

Every researcher has a favorite paper of theirs, and it’s almost never their most cited one. Find it, compliment it, and you shall earn the way to their heart.