Bartłomiej Cupiał (@cupiabart) Twitter Tweets • TwiCopy

Tim Rocktäschel

a year ago

Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led UCL DARK's Davide Paglieri! Douwe Kiela's plot below is maybe the scariest for measuring AI progress — LLM benchmarks are saturating at an accelerating rate and unless we find new ways to

Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led <a href="/UCL_DARK/">UCL DARK</a>'s <a href="/PaglieriDavide/">Davide Paglieri</a>! <a href="/douwekiela/">Douwe Kiela</a>'s plot below is maybe the scariest for measuring AI progress — LLM benchmarks are saturating at an accelerating rate and unless we find new ways to

thumb_up_off_alt53

chat_bubble_outline0

repeat5

shareShare

Ethan Mollick

@emollick

a year ago

This may sound odd, but game-based benchmarks are some of the most useful for AI, since we have human scores and they require reasoning, planning & vision The hardest of all is Nethack. No AI is close, and I suspect that an AI that can fairly win/ascend would need to be AGI-ish.

thumb_up_off_alt610

chat_bubble_outline36

repeat106

shareShare

Davide Paglieri

@paglieridavide

a year ago

The ultimate AGI test?

thumb_up_off_alt30

chat_bubble_outline1

repeat6

shareShare

Davide Paglieri

@paglieridavide

a year ago

🚨BALROG leaderboard update This week's new entries on balrogai.com are: Llama 3.3 70B Instruct 🫤 Claude 3.5 Haiku✨ Mistral-Nemo-it (12B) 🆗 Github: github.com/balrog-ai/BALR…

thumb_up_off_alt21

chat_bubble_outline1

repeat5

shareShare

Bartłomiej Cupiał

@cupiabart

10 months ago

BALROG, our benchmark for agentic LLM and VLM reasoning on games, has just been accepted to #ICLR! See you in Singapore 🇸🇬!

thumb_up_off_alt33

chat_bubble_outline1

repeat4

shareShare

Martin Klissarov

@martinklissarov

10 months ago

Can AI agents adapt zero-shot, to complex multi-step language instructions in open-ended environments? We present MaestroMotif, a method for AI-assisted skill design that produces highly capable and steerable hierarchical agents. To the best of our knowledge, it is the first

thumb_up_off_alt203

chat_bubble_outline6

repeat53

shareShare

Aviral Kumar

@aviral_kumar2

9 months ago

🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL

thumb_up_off_alt326

chat_bubble_outline5

repeat54

shareShare

Bartłomiej Cupiał

@cupiabart

9 months ago

Fascinating work from my colleagues on MoE scaling laws! 🔥 They showed you can actually get better performance with MoEs under the same memory constraints as dense models. Really cool to see how they challenged the common assumption about memory vs compute trade-offs.

thumb_up_off_alt15

chat_bubble_outline0

repeat0

shareShare

Davide Paglieri

@paglieridavide

8 months ago

A new challenger has entered the ring 🥉 This week’s entry on balrogai.com takes third place, powered by a 21B reasoning model Reka Reka Flash 3 dominates similarly sized reasoning models like DeepSeek-R1-Distill-Qwen 32B on BALROG’s toughest agentic tasks! 🧵

A new challenger has entered the ring 🥉

This week’s entry on balrogai.com takes third place, powered by a 21B reasoning model

<a href="/RekaAILabs/">Reka</a> Reka Flash 3 dominates similarly sized reasoning models like DeepSeek-R1-Distill-Qwen 32B on BALROG’s toughest agentic tasks!
🧵

thumb_up_off_alt46

chat_bubble_outline1

repeat12

shareShare

Bartłomiej Cupiał

@cupiabart

8 months ago

First time being acknowledged in an OpenAI paper 👀

thumb_up_off_alt13

chat_bubble_outline0

repeat1

shareShare

Davide Paglieri

@paglieridavide

7 months ago

Excited to be in Singapore for ICLR 2025! 🇸🇬 📷We will present BALROG at the poster session on Saturday, 3:00-5:30 PM, Hall 3, #252 Sneak peak at the poster, including the updated leaderboard with some new models, more on them soon 👀 Bartłomiej Cupiał, Ulyana Piterbarg, Tim Rocktäschel

thumb_up_off_alt61

chat_bubble_outline4

repeat12

shareShare

Bartłomiej Cupiał

@cupiabart

7 months ago

My friend and supervisor of my PhD Łukasz Kuciński is currently battling with brain cancer. Hoping for his full recovery. Please consider supporting his fight: siepomaga.pl/lukasz-kucinski

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare