Erfan Miahi (@erfan_mhi) 's Twitter Profile
Erfan Miahi

@erfan_mhi

Ex-researcher at @rlai_lab; Collaborated with people from @googledeepmind; doing mostly rl reasoning!

Doing #parkour & #reading/#writing books in my spare time

ID: 843815099462828032

linkhttps://www.linkedin.com/in/erfan-miahi-8637a1130/ calendar_today20-03-2017 13:22:51

1,1K Tweet

497 Followers

986 Following

Jeff Dean (@jeffdean) 's Twitter Profile Photo

Demis Hassabis, James Manyika, and I wrote up a (lengthy and illustrated!) overview of the AI research work and advances across Google in 2024. It's a summary of the work of many across Google, covering Gemini advances, Gemma, NotebookLM, generative image and video models like

Erfan Miahi (@erfan_mhi) 's Twitter Profile Photo

I don't understand why people say RLHF is a contextual bandit problem. Of course, the exploration is limited and the RL problem is badly formulated. But still, you have to solve the temporal credit assignment problem (updating all tokens) which is not part of c-bandit.

Erfan Miahi (@erfan_mhi) 's Twitter Profile Photo

Reinforcement learning once shook world politics, especially in China, with AlphaGo, and now ~9 years later, DeepSeek R1 is doing it again on a much larger scale. RL IS THE FUTURE, as I always believed. It's the ultimate chess move in the game of intelligence. #DeepSeek #AI #rl

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Shows that: - RL generalizes in rule-based envs, esp. when trained with an outcome-based reward - SFT tends to memorize the training data and struggles to generalize OOD

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Shows that:
- RL generalizes in rule-based envs, esp. when trained with an outcome-based reward
- SFT tends to memorize the training data and struggles to generalize OOD
Qwen (@alibaba_qwen) 's Twitter Profile Photo

Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1. Blog: qwenlm.github.io/blog/qwq-32b HF: huggingface.co/Qwen/QwQ-32B ModelScope: modelscope.cn/models/Qwen/Qw… Demo: huggingface.co/spaces/Qwen/Qw… Qwen Chat:

Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1.

Blog: qwenlm.github.io/blog/qwq-32b
HF: huggingface.co/Qwen/QwQ-32B
ModelScope: modelscope.cn/models/Qwen/Qw…
Demo: huggingface.co/spaces/Qwen/Qw…
Qwen Chat:
Richard Sutton (@richardssutton) 's Twitter Profile Photo

David Silver really hits it out of the park in this podcast. The paper "Welcome to the Era of Experience" is here: goo.gle/3EiRKIH.

λux (@novasarc01) 's Twitter Profile Photo

the mit 6.S184 lectures on flow matching and diffusion are really helpful for those who want to start with flow matching and in depth intuition behind it

the mit 6.S184 lectures on flow matching and diffusion are really helpful for those who want to start with flow matching and in depth intuition behind it
Erfan Miahi (@erfan_mhi) 's Twitter Profile Photo

It’s crazy how the demand for training coding models with RL has exploded in just a few months. People from finance to IT are literally throwing 💰 at me! Everybody wants their own specialized coding model now. Wild compared to just a few months ago.