Ofir Press (@ofirpress) Twitter Tweets • TwiCopy

Ofir Press

@ofirpress

+ Follow

I build tough benchmarks for LMs and then I get the LMs to solve them. Postdoc @Princeton. PhD from @nlpnoah @UW. Ex-visiting researcher @MetaAI & @MosaicML.

ID: 746788615951355904

linkhttps://ofir.io/about calendar_today25-06-2016 19:34:15

2,2K Tweet

12,12K Followers

4,4K Following

Ofir Press

@ofirpress

5 months ago

To train a local SWE-agent you need to synthetically generate SWE-bench-like problems+solutions. SWE-smith is our scalable solution, and we got amazing results: SWE-agent-LM 32B is an extremely strong local agent. The tutorial vids are great! John Yang youtube.com/watch?v=HO8i0T…

thumb_up_off_alt17

chat_bubble_outline1

repeat2

shareShare

Ofir Press

@ofirpress

5 months ago

So cool!!

thumb_up_off_alt7

chat_bubble_outline1

repeat0

shareShare

Alex Zhang

@a1zhang

5 months ago

Gemini 2.5 Flash plays Final Fantasy in real-time (VideoGameBench)

thumb_up_off_alt17

chat_bubble_outline3

repeat2

shareShare

Ben Shi

@benshi34

5 months ago

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

thumb_up_off_alt177

chat_bubble_outline6

repeat39

shareShare

Ofir Press

@ofirpress

5 months ago

Congrats Dr. Merrill!

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Ofir Press

@ofirpress

5 months ago

Great to see super quick and lightweight models with good SWE-bench numbers! Up until ~6 months ago these were the numbers that the frontier models were getting.

thumb_up_off_alt19

chat_bubble_outline0

repeat2

shareShare

SWE-bench

@swebench

5 months ago

We just updated the SWE-bench Multimodal leaderboard with new systems from Refact.ai, All Hands AI and TU München. Congrats to all teams on pushing the state-of-the-art performance! SWE-bench Multimodal challenges AI systems to fix issues that are described using screenshots.

We just updated the SWE-bench Multimodal leaderboard with new systems from <a href="/refact_ai/">Refact.ai</a>, <a href="/allhands_ai/">All Hands AI</a> and <a href="/TU_Muenchen/">TU München</a>. Congrats to all teams on pushing the state-of-the-art performance!

SWE-bench Multimodal challenges AI systems to fix issues that are described using screenshots.

thumb_up_off_alt18

chat_bubble_outline1

repeat5

shareShare

Ofir Press

@ofirpress

4 months ago

NeurIPS reviews are due on Wednesday. Thanks John Yang for reminding me :)

NeurIPS reviews are due on Wednesday. Thanks <a href="/jyangballin/">John Yang</a> for reminding me :)

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Ori Press

@ori_press

4 months ago

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

thumb_up_off_alt141

chat_bubble_outline6

repeat54

shareShare

Brandon Amos

@brandondamos

4 months ago

Excited to release AlgoTune!! It's a benchmark and coding agent for optimizing the runtime of numerical code 🚀 algotune.io 📚 algotune.io/paper.pdf 🤖 github.com/oripress/AlgoT… with Ofir Press Ori Press Patrick Kidger Bartolomeo Stellato Arman Zharmagambetov & many others 🧵

thumb_up_off_alt129

chat_bubble_outline2

repeat26

shareShare

Ofir Press

@ofirpress

4 months ago

Im trying to edit the SWE-bench website's CSS and top models are really struggling. People please evaluate on SWE-bench Multimodal more so I don't have to spend time on our CSS. I'm begging you this is soooo boring and frustrating. Also what do you think about the redesign?

thumb_up_off_alt13

chat_bubble_outline2

repeat0

shareShare

Ofir Press

@ofirpress

4 months ago

Just removing a lot of vertical space so you can see more entries without scrolling...

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare