Chanwoo Park (@chanwoopark20) Twitter Tweets • TwiCopy

Chanwoo Park

@chanwoopark20

+ Follow

Games, Multi-agent (gen) AI | @speedrun SR003 | @mit EECS Ph.D. Candidate

ID: 1457347791723069440

linkhttps://chanwoo-park-official.github.io/ calendar_today07-11-2021 14:04:11

702 Tweet

1,1K Followers

1,1K Following

Chanwoo Park

@chanwoopark20

6 months ago

The LMSYS Chat dataset hasn’t been updated in over a year, yet recent models consistently top the leaderboard early on. Are rankings being manipulated? Analyzing model selection trends over time could reveal the right metric for fair LLM evaluation... Wanna check the recent

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Chanwoo Park

@chanwoopark20

6 months ago

I even believe that RL (or Regret training) is needed before SFT - or even pertaining phase -- if you worked on RL with LLM, you will know what I am saying. Super nice to read this paper!

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Chanwoo Park

@chanwoopark20

6 months ago

One of the biggest cultural differences between East Asians and Americans is the concept of "face" (面子, 체면)—a nuanced idea that encompasses social reputation, honor, and maintaining harmony in interpersonal relationships, with no direct English equivalent. This distinction

thumb_up_off_alt8

chat_bubble_outline1

repeat0

shareShare

Chanwoo Park

@chanwoopark20

6 months ago

Cool definition :)

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Chanwoo Park

@chanwoopark20

6 months ago

Amazing... I should use muP..

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Chanwoo Park

@chanwoopark20

5 months ago

Pretraining with RL for reasoning. Love this approach.

thumb_up_off_alt8

chat_bubble_outline0

repeat0

shareShare

Chanwoo Park

@chanwoopark20

5 months ago

Nice approach. Good to see that a16z sees the future in social simulations.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Vishal Pandey

@its_vayishu

5 months ago

I interviewed for an ML research internship at Meta (FAIR) a few years back. Don’t remember every detail now, but a few questions stuck with me. Questions are below.

thumb_up_off_alt4,4K

chat_bubble_outline40

repeat300

shareShare

Chanwoo Park

@chanwoopark20

5 months ago

My mother uses ChatGPT and trades ETH—no surprise there! I really hope South Korea can emerge as a global leader across key tech domains like robotics, AI, crypto, etc. On that note, Korea also offers strong government-backed investment support for startups, which is a great

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Zae Myung Kim

@zaemyung

4 months ago

🚨 New Paper Alert! 🚨 How can we align language models without drowning in prompt engineering or falling into reward hacking traps? We introduce Meta Policy Optimization (MPO)—a new reinforcement learning framework that evolves its own reward model rubrics through meta-level

thumb_up_off_alt12

chat_bubble_outline0

repeat4

shareShare

Chanwoo Park

@chanwoopark20

4 months ago

That is the reason you need an evolving reward function. huggingface.co/papers/2504.20… Check out this paper. -- providing some answers about "curriculum learning / evolving reward / reward hacking" using evaluative thinking. Zae Myung Kim

thumb_up_off_alt12

chat_bubble_outline0

repeat0

shareShare

Chanwoo Park

@chanwoopark20

4 months ago

This is a banger. I read it thoroughly and now understand large model training better.

thumb_up_off_alt8

chat_bubble_outline0

repeat0

shareShare

Mingyang Liu

@liumy2010

3 months ago

We propose Unified Fine-Tuning (UFT) — a novel post-training framework that unifies Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), and outperforms both. 📈 Superior performance across model sizes and tasks. 📚 Theory-backed: Achieves exponential improvement in

thumb_up_off_alt16

chat_bubble_outline7

repeat3

shareShare

Stella Li

@stellalisy

3 months ago

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…