Bartłomiej Cupiał (@cupiabart) 's Twitter Profile
Bartłomiej Cupiał

@cupiabart

I sure do like machine learning

ID: 1130582450080555008

calendar_today20-05-2019 21:13:56

86 Tweet

1,1K Followers

430 Following

Tim Rocktäschel (@_rockt) 's Twitter Profile Photo

Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led UCL DARK's Davide Paglieri! Douwe Kiela's plot below is maybe the scariest for measuring AI progress — LLM benchmarks are saturating at an accelerating rate and unless we find new ways to

Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led <a href="/UCL_DARK/">UCL DARK</a>'s <a href="/PaglieriDavide/">Davide Paglieri</a>! <a href="/douwekiela/">Douwe Kiela</a>'s plot below is maybe the scariest for measuring AI progress — LLM benchmarks are saturating at an accelerating rate and unless we find new ways to
Ethan Mollick (@emollick) 's Twitter Profile Photo

This may sound odd, but game-based benchmarks are some of the most useful for AI, since we have human scores and they require reasoning, planning & vision The hardest of all is Nethack. No AI is close, and I suspect that an AI that can fairly win/ascend would need to be AGI-ish.

This may sound odd, but game-based benchmarks are some of the most useful for AI, since we have human scores and they require reasoning, planning &amp; vision

The hardest of all is Nethack. No AI is close, and I suspect that an AI that can fairly win/ascend would need to be AGI-ish.
Davide Paglieri (@paglieridavide) 's Twitter Profile Photo

🚨BALROG leaderboard update This week's new entries on balrogai.com are: Llama 3.3 70B Instruct 🫤 Claude 3.5 Haiku✨ Mistral-Nemo-it (12B) 🆗 Github: github.com/balrog-ai/BALR…

🚨BALROG leaderboard update

This week's new entries on balrogai.com are: 

Llama 3.3 70B Instruct 🫤
Claude 3.5 Haiku✨
Mistral-Nemo-it (12B) 🆗

Github: github.com/balrog-ai/BALR…
Bartłomiej Cupiał (@cupiabart) 's Twitter Profile Photo

BALROG, our benchmark for agentic LLM and VLM reasoning on games, has just been accepted to #ICLR! See you in Singapore 🇸🇬!

Martin Klissarov (@martinklissarov) 's Twitter Profile Photo

Can AI agents adapt zero-shot, to complex multi-step language instructions in open-ended environments? We present MaestroMotif, a method for AI-assisted skill design that produces highly capable and steerable hierarchical agents. To the best of our knowledge, it is the first

Aviral Kumar (@aviral_kumar2) 's Twitter Profile Photo

🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL

Bartłomiej Cupiał (@cupiabart) 's Twitter Profile Photo

Fascinating work from my colleagues on MoE scaling laws! 🔥 They showed you can actually get better performance with MoEs under the same memory constraints as dense models. Really cool to see how they challenged the common assumption about memory vs compute trade-offs.

Davide Paglieri (@paglieridavide) 's Twitter Profile Photo

A new challenger has entered the ring 🥉 This week’s entry on balrogai.com takes third place, powered by a 21B reasoning model Reka Reka Flash 3 dominates similarly sized reasoning models like DeepSeek-R1-Distill-Qwen 32B on BALROG’s toughest agentic tasks! 🧵

A new challenger has entered the ring 🥉

This week’s entry on balrogai.com takes third place, powered by a 21B reasoning model

<a href="/RekaAILabs/">Reka</a> Reka Flash 3 dominates similarly sized reasoning models like DeepSeek-R1-Distill-Qwen 32B on BALROG’s toughest agentic tasks!
🧵
Davide Paglieri (@paglieridavide) 's Twitter Profile Photo

Excited to be in Singapore for ICLR 2025! 🇸🇬 📷We will present BALROG at the poster session on Saturday, 3:00-5:30 PM, Hall 3, #252 Sneak peak at the poster, including the updated leaderboard with some new models, more on them soon 👀 Bartłomiej Cupiał, Ulyana Piterbarg, Tim Rocktäschel

Excited to be in Singapore for ICLR 2025! 🇸🇬 

📷We will present BALROG at the poster session on Saturday, 3:00-5:30 PM, Hall 3, #252

Sneak peak at the poster, including the updated leaderboard with some new models, more on them soon 👀

<a href="/CupiaBart/">Bartłomiej Cupiał</a>, <a href="/ulyanapiterbarg/">Ulyana Piterbarg</a>, <a href="/_rockt/">Tim Rocktäschel</a>
Bartłomiej Cupiał (@cupiabart) 's Twitter Profile Photo

My friend and supervisor of my PhD Łukasz Kuciński is currently battling with brain cancer. Hoping for his full recovery. Please consider supporting his fight: siepomaga.pl/lukasz-kucinski