Pang Wei Koh (@pangweikoh) 's Twitter Profile
Pang Wei Koh

@pangweikoh

Assistant professor at @uwcse and visiting research scientist at @allen_ai. Formerly @StanfordAILab @GoogleAI @Coursera. 🇸🇬

ID: 1273467805283659777

linkhttps://koh.pw calendar_today18-06-2020 04:09:26

323 Tweet

3,3K Followers

900 Following

Ai2 (@allen_ai) 's Twitter Profile Photo

With fresh support of $75M from U.S. National Science Foundation and $77M from @NVIDIA, we’re set to scale our open model ecosystem, bolster the infrastructure behind it, and fast‑track reproducible AI research to unlock the next wave of scientific discovery. 💡

With fresh support of $75M from <a href="/NSF/">U.S. National Science Foundation</a> and $77M from @NVIDIA, we’re set to scale our open model ecosystem, bolster the infrastructure behind it, and fast‑track reproducible AI research to unlock the next wave of scientific discovery. 💡
Yi Tay (@yitayml) 's Twitter Profile Photo

Had a really wonderful time hosting Jeff Dean, Quoc Le, benoit schillings and Denny Zhou in Singapore for the Google DeepMind Gemini Singapore 🇸🇬 event last week! 🔥 The event went super well imo, the vibes were on-point and an overwhelming number of people told me directly

Had a really wonderful time hosting <a href="/JeffDean/">Jeff Dean</a>, <a href="/quocleix/">Quoc Le</a>, <a href="/benoitschilling/">benoit schillings</a> and <a href="/denny_zhou/">Denny Zhou</a> in Singapore for the <a href="/GoogleDeepMind/">Google DeepMind</a> Gemini Singapore 🇸🇬 event last week! 🔥

The event went super well imo, the vibes were on-point and an overwhelming number of people told me directly
Valentin Hofmann (@vjhofmann) 's Twitter Profile Photo

📢 New #COLM2025 paper 📢 Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴 Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost. 🧵

📢 New #COLM2025 paper 📢

Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴

Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.

🧵
Ai2 (@allen_ai) 's Twitter Profile Photo

Introducing OlmoEarth 🌍, state-of-the-art AI foundation models paired with ready-to-use open infrastructure to turn Earth data into clear, up-to-date insights within hours—not years.

Zhiyuan Zeng (@zhiyuanzeng_) 's Twitter Profile Photo

RL is bounded by finite data😣? Introducing RLVE: RL with Adaptive Verifiable Environments We scale RL with data procedurally generated from 400 envs dynamically adapting to the trained model 💡find supervision signals right at the LM capability frontier + scale them 🔗in🧵

RL is bounded by finite data😣?
Introducing RLVE: RL with Adaptive Verifiable Environments

We scale RL with data procedurally generated from 400 envs dynamically adapting to the trained model

💡find supervision signals right at the LM capability frontier + scale them

🔗in🧵
Pang Wei Koh (@pangweikoh) 's Twitter Profile Photo

Two ideas here for scaling up RL for reasoning: 1. Procedurally generating (verifiable) problems lets us adapt difficulty to the model, making training more efficient 2. Teaching the model to reason by hand (e.g., sort numbers w/o code) generalizes to realistic reasoning tasks!

Rulin Shao (@rulinshao) 's Twitter Profile Photo

Chinese ancient wisdom from Confucius says 因材施教—adjust your way of teaching according to the student’ abilities. Check out amazing Zhiyuan Zeng’s work on applying such wisdom in RL!

Tong Chen @ ICLR (@tomchen0) 's Twitter Profile Photo

OpenAI's blog (openai.com/index/why-lang…) points out that today’s language models hallucinate because training and evaluation reward guessing instead of admitting uncertainty. This raises a natural question: can we reduce hallucination without hurting utility?🤔 On-policy RL with

OpenAI's blog (openai.com/index/why-lang…) points out that today’s language models hallucinate because training and evaluation reward guessing instead of admitting uncertainty. This raises a natural question: can we reduce hallucination without hurting utility?🤔

On-policy RL with
Rulin Shao (@rulinshao) 's Twitter Profile Photo

🔥Thrilled to introduce DR Tulu-8B, an open long-form Deep Research model that matches OpenAI DR 💪Yes, just 8B! 🚀 The secret? We present Reinforcement Learning with Evolving Rubrics (RLER) for long-form non-verifiable DR tasks! Our rubrics: - co-evolve with the policy model -

🔥Thrilled to introduce DR Tulu-8B, an open long-form Deep Research model that matches OpenAI DR 💪Yes, just 8B! 🚀

The secret? We present Reinforcement Learning with Evolving Rubrics (RLER) for long-form non-verifiable DR tasks! Our rubrics:
- co-evolve with the policy model
-
Pang Wei Koh (@pangweikoh) 's Twitter Profile Photo

We trained an open deep research model!🔍 The hard part is training signal -- deep research tasks are long-form with so many dimensions to what makes a good answer. We solve this thru RL with question-specific rubrics that co-evolve with the policy model. Check it out below!

Pang Wei Koh (@pangweikoh) 's Twitter Profile Photo

Unexpected benefit: We've been building tools to help doctors determine treatments for rare genetic diseases, which involves lots of searching -- a natural deep research task! Surprisingly, our 8B model generalizes well and can even match/outperform OpenAI DR on this OOD eval.

John Hewitt (@johnhewtt) 's Twitter Profile Photo

Come do a PhD with me at Columbia! My lab tackles basic problems in alignment, interpretability, safety, and capabilities of language systems. If you love adventuring in model internals and behaviors---to understand and improve---let's do it together! pic: a run in central park

Come do a PhD with me at Columbia!

My lab tackles basic problems in alignment, interpretability, safety, and capabilities of language systems. If you love adventuring in model internals and behaviors---to understand and improve---let's do it together!

pic: a run in central park
Ai2 (@allen_ai) 's Twitter Profile Photo

Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey. Best fully open 32B reasoning model & best 32B base model. 🧵

Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, &amp; tool use, and an open model flow—not just the final weights, but the entire training journey.
Best fully open 32B reasoning model &amp; best 32B base model. 🧵
Hanna Hajishirzi (@hannahajishirzi) 's Twitter Profile Photo

Introducing Olmo 3 and our entire model flow to build Olmo 3-Think and Olmo3-Instruct. Strong results, big improvements. Massive shoutout to the team who made it happen. Lots of exciting new things come with this release:

Introducing Olmo 3 and our entire model flow to build Olmo 3-Think and Olmo3-Instruct. Strong results, big improvements. Massive shoutout to the team who made it happen. Lots of exciting new things come with this release:
Scott Geng (@scottgeng00) 's Twitter Profile Photo

Super excited to release Olmo 3 🦕🐄! Wild to see my Delta Learning research go all the way from theory-land to becoming a core piece of the world’s best fully open model. It's good day to be a researcher 🥳

Super excited to release Olmo 3 🦕🐄!

Wild to see my Delta Learning research go all the way from theory-land to becoming a core piece of the world’s best fully open model.

It's good day to be a researcher 🥳
Serina Chang (@serinachang5) 's Twitter Profile Photo

📢 Come work with me at UC Berkeley Berkeley AI Research! I’m recruiting PhD students in UC Berkeley EECS and UC Joint Computational Precision Health Program. I work on AI for social good, simulating humans with AI, human-AI interaction, and applications in public health & social science. serinachang5.github.io

📢 Come work with me at UC Berkeley <a href="/berkeley_ai/">Berkeley AI Research</a>! I’m recruiting PhD students in <a href="/Berkeley_EECS/">UC Berkeley EECS</a> and <a href="/UCJointCPH/">UC Joint Computational Precision Health Program</a>. I work on AI for social good, simulating humans with AI, human-AI interaction, and applications in public health &amp; social science.

serinachang5.github.io
Michael Noukhovitch, gonna be @ICLR 2025 (@mnoukhov) 's Twitter Profile Photo

Because Olmo 3 is fully open, we decontaminate our evals from our pretraining and midtraining data. Stella Li proves this with spurious rewards: RL trained on a random reward signal can't improve on the evals, unlike some previous setups

Because Olmo 3 is fully open, we decontaminate our evals from our pretraining and midtraining data. <a href="/StellaLisy/">Stella Li</a> proves this with spurious rewards: RL trained on a random reward signal can't improve on the evals, unlike some previous setups
Luca Soldaini ✈️ ICLR 25 (@soldni) 's Twitter Profile Photo

Thread of appreciation for a few of the students and interns that made Olmo 3 special (just the ones i was fortunate to work with! all Ai2 interns are great!!) 🧵