Minyang Tian (@minyangtian1) 's Twitter Profile
Minyang Tian

@minyangtian1

PhD candidate at UIUC, co-advised by @haopeng_nlp and Eliu Huerta @argonne and @UChicago

ID: 1813179338654949379

calendar_today16-07-2024 11:50:26

19 Tweet

129 Followers

115 Following

Ofir Press (@ofirpress) 's Twitter Profile Photo

Join us on August 14th at 3PM Eastern / 12PM Pacific to learn about the three new benchmarks we've recently released: SciCode, AssistantBench and CiteMe. We will also have some SWE-bench updates. The event will be on Zoom. lu.ma/4240w5us

Ofir Press (@ofirpress) 's Twitter Profile Photo

Announcing Ofir's Gelato Challenge: At NeurIPS 2024, I will buy gelato for the team that has the highest combined score on SWE-bench Lite, AssistantBench, CiteME, and SciCode. Final submission is by December 3, 2024.

Announcing Ofir's Gelato Challenge:
At NeurIPS 2024, I will buy gelato for the team that has the highest combined score on SWE-bench Lite, AssistantBench, CiteME, and SciCode. Final submission is by December 3, 2024.
Ofir Press (@ofirpress) 's Twitter Profile Photo

SciCode is our new benchmark, with very tough programming challenges written by real scientists. scicode-bench.github.io for more details.

Akari Asai (@akariasai) 's Twitter Profile Photo

1/ Introducing ᴏᴘᴇɴꜱᴄʜᴏʟᴀʀ: a retrieval-augmented LM to help scientists synthesize knowledge 📚 UW NLP Ai2 With open models & 45M-paper datastores, it outperforms proprietary systems & match human experts. Try out our demo! We also introduce ꜱᴄʜᴏʟᴀʀQᴀʙᴇɴᴄʜ,

Ofir Press (@ofirpress) 's Twitter Profile Photo

Thanks everyone for coming to our poster yesterday! Lots of SWE-agent news coming soon. In 30 mins, with Minyang Tian et al we'll present SciCode, a super tough scientific coding benchmark that o1 gets 7% on. West Ballroom A-D #5204. Come through :)

Thanks everyone for coming to our poster yesterday! Lots of SWE-agent news coming soon.

In 30 mins, with <a href="/MinyangTian1/">Minyang Tian</a> et al we'll present SciCode, a super tough scientific coding benchmark that o1 gets 7% on. West Ballroom A-D #5204. Come through :)
Ofir Press (@ofirpress) 's Twitter Profile Photo

SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.

SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. 

To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.
Ofir Press (@ofirpress) 's Twitter Profile Photo

Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5. SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.

Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5.

SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.
Ofir Press (@ofirpress) 's Twitter Profile Photo

Proud to see companies starting to use our SciCode to eval LMs. SciCode has some questions taken from Nobel-winning research in physics so it's super exciting to get more people to work on improving these abilities. scicode-bench.github.io

Shivam Agarwal (@shivamag12) 's Twitter Profile Photo

Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮 At inference EM can beat GPT4o Claude 3 opus & Gemini 1.5 pro on challenging scientific coding w/o any data/model update

Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮

At inference EM can beat GPT4o Claude 3 opus &amp; Gemini 1.5 pro on challenging scientific coding w/o any data/model update