Jing Yu Koh (@kohjingyu) Twitter Tweets • TwiCopy

Jing Yu Koh

9 months ago

When I was 7 I also got stuck in Mt. Moon, and I asked a friend who had already completed the game to take over and help me out. This is probably not dissimilar from how computer use agents will eventually work.

thumb_up_off_alt34

chat_bubble_outline3

repeat1

shareShare

Jing Yu Koh

@kohjingyu

8 months ago

I appreciate model runs that are so terrible they're obviously bugged. The hardest experiments to debug are the ones that are only slightly bad.

thumb_up_off_alt10

chat_bubble_outline0

repeat0

shareShare

Jing Yu Koh

@kohjingyu

8 months ago

I'm super excited about these kinds of multimodal image+text -> image+text models. I think they will unlock a ton of interesting applications. Congrats on the launch!

thumb_up_off_alt16

chat_bubble_outline1

repeat1

shareShare

Jason Baldridge

@jasonbaldridge

8 months ago

Media theorists Nicolas Malevé and Katrina Sluis interviewed me about my work on image generation at the end of 2023, and the writeup of that interview is finally available! It's a somewhat historical overview with my personal take on various developments.

thumb_up_off_alt9

chat_bubble_outline1

repeat2

shareShare

Yu Su @#ICLR2025

@ysu_nlp

8 months ago

🔥2025 is the year of agents, but are we there yet?🤔 🤯 "An Illusion of Progress? Assessing the Current State of Web Agents" –– our new study shows that frontier web agents may be far less competent (up to 59%) than previously reported! Why were benchmark numbers inflated? -

thumb_up_off_alt230

chat_bubble_outline10

repeat66

shareShare

Jacob Springer

@jacspringer

8 months ago

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

thumb_up_off_alt790

chat_bubble_outline16

repeat173

shareShare

Jason Baldridge

@jasonbaldridge

8 months ago

The Parti never ends. ;-)

thumb_up_off_alt30

chat_bubble_outline0

repeat1

shareShare

Russ Salakhutdinov

@rsalakhu

7 months ago

Llama4 models are out! Open sourced! Check them out: “Native multimodality, mixture-of-experts models, super long context windows, step changes in performance, and unparalleled efficiency. All in easy-to-deploy sizes custom fit for how you want to use it” llama.com

thumb_up_off_alt150

chat_bubble_outline4

repeat22

shareShare

Graham Neubig

@gneubig

7 months ago

A big two days of agents starting tomorrow at CMU (and then two days of agent hackathon after that!) Registration is still open so if you're in or around Pittsburgh come one come all: cmu-agent-workshop.github.io We also plan to livestream for participants who can't make it in person

thumb_up_off_alt188

chat_bubble_outline5

repeat40

shareShare

Xing Han Lu

@xhluca

7 months ago

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and

thumb_up_off_alt230

chat_bubble_outline4

repeat100

shareShare

Christina Baek

@_christinabaek

7 months ago

Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N

thumb_up_off_alt478

chat_bubble_outline6

repeat103

shareShare

Jing Yu Koh

@kohjingyu

7 months ago

Major FOMO about missing ICLR this year. But I hope you all enjoy Singapore, and push for more AI conferences to be held there ;)

thumb_up_off_alt28

chat_bubble_outline2

repeat0

shareShare

John Yang

@jyangballin

6 months ago

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

thumb_up_off_alt638

chat_bubble_outline25

repeat132

shareShare

Jing Yu Koh

@kohjingyu

6 months ago

If you’re serious about building a good model you should do all of your important evals in-house. Implementing and running evals is wildly informative about what data/techniques you should be looking into. This becomes especially true the more non-standard your eval is (e.g.,

thumb_up_off_alt27

chat_bubble_outline0

repeat5

shareShare

Jing Yu Koh

@kohjingyu

6 months ago

Huh, and I thought computer use agents was easily scooped. Must be stressful to be a PhD student working on reasoning right now.

thumb_up_off_alt63

chat_bubble_outline1

repeat0

shareShare

Keshigeyan Chandrasegaran

@keshigeyan

5 months ago

1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 grafting.stanford.edu Co-led with Michael Poli

thumb_up_off_alt117

chat_bubble_outline5

repeat28

shareShare

Jing Yu Koh

@kohjingyu

5 months ago

Everytime an “LLMs don’t do X” paper pops off or wins an award, I think about the guys who won a Nobel Prize in economics for their paper that proved the used car market cannot exist

thumb_up_off_alt26

chat_bubble_outline0

repeat1

shareShare

Jing Yu Koh

@kohjingyu

5 months ago

Every researcher has a favorite paper of theirs, and it’s almost never their most cited one. Find it, compliment it, and you shall earn the way to their heart.

thumb_up_off_alt71

chat_bubble_outline2

repeat0

shareShare