Sida Wang (@sidawxyz) 's Twitter Profile
Sida Wang

@sidawxyz

ID: 20573627

calendar_today11-02-2009 05:43:03

25 Tweet

475 Followers

305 Following

Daniel Fried (@dan_fried) 's Twitter Profile Photo

I’m excited to release a paper (and model weights!) for InCoder: a generative code model that can infill as well as do left-to-right generation. Project page: sites.google.com/view/incoder-c… Demo: huggingface.co/spaces/faceboo… Paper: github.com/dpfried/incode… Thread (1/n):

Freda Shi (@fredahshi) 's Twitter Profile Photo

Late post but let’s do this! Happy to share our #EMNLP2022 work on translating natural language to executable code with execution-aware minimum Bayes risk decoding 📝Paper: arxiv.org/pdf/2204.11454… 📇Code: github.com/facebookresear… 📦Data (codex output):dl.fbaipublicfiles.com/mbr-exec/mbr-e… (1/n)

Late post but let’s do this! Happy to share our #EMNLP2022 work on translating natural language to executable code with execution-aware minimum Bayes risk decoding
📝Paper: arxiv.org/pdf/2204.11454…
📇Code: github.com/facebookresear…
📦Data (codex output):dl.fbaipublicfiles.com/mbr-exec/mbr-e…
(1/n)
Tianyi Zhang (@tianyi_zh) 's Twitter Profile Photo

🧑‍💻Code review is an important practice in software development. We take this idea into code generation and propose Coder-Reviewer reranking. Our method reranks via the product of a coder model p(code|instruction) and a reviewer model p(instruction|code). arxiv.org/abs/2211.16490

🧑‍💻Code review is an important practice in software development. We take this idea into code generation and propose Coder-Reviewer reranking. Our method reranks via the product of a coder model p(code|instruction) and a reviewer model p(instruction|code). arxiv.org/abs/2211.16490
Ansong Ni (@ansongni) 's Twitter Profile Photo

Execution results are strong indicators of program correctness. But how can we improve LLMs for code generation with execution? In our new paper, we propose LEVER, a simple method that learns to verify and rerank LLM-generated programs with their execution results. 🧵👇 (1/n)

Execution results are strong indicators of program correctness. But how can we improve LLMs for code generation with execution?

In our new paper, we propose LEVER, a simple method that learns to verify and rerank LLM-generated programs with their execution results.  🧵👇 (1/n)
Alex Gu @ iclr (@minimario1729) 's Twitter Profile Photo

📢Introducing CRUXEval, a benchmark to measure Python code execution! 🏠Homepage: crux-eval.github.io 📜Paper: crux-eval.github.io/paper/cruxeval… 🏆Leaderboard: crux-eval.github.io/leaderboard.ht… 🔎Sample Explorer: crux-eval.github.io/demo.html 📊HF Dataset: huggingface.co/datasets/cruxe…

📢Introducing CRUXEval, a benchmark to measure Python code execution!

🏠Homepage: crux-eval.github.io
📜Paper: crux-eval.github.io/paper/cruxeval…
🏆Leaderboard: crux-eval.github.io/leaderboard.ht…
🔎Sample Explorer: crux-eval.github.io/demo.html
📊HF Dataset: huggingface.co/datasets/cruxe…
Naman Jain @ ICLR (@stringchaos) 's Twitter Profile Photo

📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination

📢📢Excited to introduce our new work LiveCodeBench!

📈 Live evaluations to ensure fairness and reliability
🔍 Holistic evaluations using 4 code-related scenarios
💡Insights from comparing 20+ code models

🚨🚨We use problem release dates to detect and prevent contamination
Jonas Gehring (@jnsgehring) 's Twitter Profile Photo

LLMs for code should do much better if they can iterate on tests -- but they don't. Our new work (RLEF) addresses this with execution feedback at RL *training time* to use execution feedback at *inference time*. arxiv.org/abs/2410.02089 is just out! 1/6

LLMs for code should do much better if they can iterate on tests -- but they don't. Our new work (RLEF) addresses this with execution feedback at RL *training time* to use execution feedback at *inference time*. arxiv.org/abs/2410.02089 is just out! 1/6
John Yang (@jyangballin) 's Twitter Profile Photo

We're launching SWE-bench Multimodal to eval agents' ability to solve visual GitHub issues. - 617 *brand new* tasks from 17 JavaScript repos - Each task has an image! Existing agents struggle here! We present SWE-agent Multimodal to remedy some issues Led w/ carlos 🧵

We're launching SWE-bench Multimodal to eval agents' ability to solve visual GitHub issues.
- 617 *brand new* tasks from 17 JavaScript repos
- Each task has an image!

Existing agents struggle here! We present SWE-agent Multimodal to remedy some issues
Led w/ <a href="/_carlosejimenez/">carlos</a>
🧵
Gabriel Synnaeve (@syhw) 's Twitter Profile Photo

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution arxiv.org/abs/2502.18449 by Yuxiang Wei Sida Wang and the whole team! Get started with your favorite model here github.com/facebookresear…