Ofir Press (@ofirpress) 's Twitter Profile
Ofir Press

@ofirpress

I build tough benchmarks for LMs and then I get the LMs to solve them. Postdoc @Princeton. PhD from @nlpnoah @UW. Ex-visiting researcher @MetaAI & @MosaicML.

ID: 746788615951355904

linkhttps://ofir.io/about calendar_today25-06-2016 19:34:15

2,2K Tweet

12,12K Followers

4,4K Following

Ofir Press (@ofirpress) 's Twitter Profile Photo

To train a local SWE-agent you need to synthetically generate SWE-bench-like problems+solutions. SWE-smith is our scalable solution, and we got amazing results: SWE-agent-LM 32B is an extremely strong local agent. The tutorial vids are great! John Yang youtube.com/watch?v=HO8i0T…

Ben Shi (@benshi34) 's Twitter Profile Photo

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes? In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:

As we optimize model reasoning over verifiable objectives, how does this affect human understanding of said reasoning to achieve superior collaborative outcomes?

In our new preprint, we investigate human-centric model reasoning for knowledge transfer 🧵:
Ofir Press (@ofirpress) 's Twitter Profile Photo

Great to see super quick and lightweight models with good SWE-bench numbers! Up until ~6 months ago these were the numbers that the frontier models were getting.

Great to see super quick and lightweight models with good SWE-bench numbers! Up until ~6 months ago these were the numbers that the frontier models were getting.
Ori Press (@ori_press) 's Twitter Profile Photo

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

Do language models have algorithmic creativity?

To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
Brandon Amos (@brandondamos) 's Twitter Profile Photo

Excited to release AlgoTune!! It's a benchmark and coding agent for optimizing the runtime of numerical code 🚀 algotune.io 📚 algotune.io/paper.pdf 🤖 github.com/oripress/AlgoT… with Ofir Press Ori Press Patrick Kidger Bartolomeo Stellato Arman Zharmagambetov & many others 🧵

Ofir Press (@ofirpress) 's Twitter Profile Photo

Im trying to edit the SWE-bench website's CSS and top models are really struggling. People please evaluate on SWE-bench Multimodal more so I don't have to spend time on our CSS. I'm begging you this is soooo boring and frustrating. Also what do you think about the redesign?

Im trying to edit the SWE-bench website's CSS and top models are really struggling. People please evaluate on SWE-bench Multimodal more so I don't have to spend time on our CSS. I'm begging you this is soooo boring and frustrating. 

Also what do you think about the redesign?