Lucas Nestler (@_clashluke) 's Twitter Profile
Lucas Nestler

@_clashluke

Researcher at keenagi.com, TensorFork member, building @_algolens

ID: 1312500980823601152

linkhttps://nestler.sh/ calendar_today03-10-2020 21:14:29

1,1K Tweet

4,4K Followers

282 Following

Lucas Nestler (@_clashluke) 's Twitter Profile Photo

Claude 3.7 Sonnet can play Pokémon Red? Interesting. But *playing* isn't *winning*. Anthropic, let's quantify "play." I propose a rigorous benchmark: Anthropic's Claude vs. nunu.ai's agent. Head-to-head, *Pokémon Red/Blue* (original, no save states). One month. Then, a

Claude 3.7 Sonnet can play Pokémon Red? Interesting. But *playing* isn't *winning*.

<a href="/AnthropicAI/">Anthropic</a>, let's quantify "play."
I propose a rigorous benchmark: Anthropic's Claude vs. <a href="/nunudotai/">nunu.ai</a>'s agent. Head-to-head, *Pokémon Red/Blue* (original, no save states).

One month. Then, a
Lucas Nestler (@_clashluke) 's Twitter Profile Photo

Don't underestimate this change! Simply swapping LayerNorm with DyT (tanh-based) maintains AdamW convergence levels. Why is this big news? Second-order optimizers perform best on normalization-free architectures - which is precisely what DyT enables x.com/liuzhuang1234/…

Don't underestimate this change!

Simply swapping LayerNorm with DyT (tanh-based) maintains AdamW convergence levels.

Why is this big news?
Second-order optimizers perform best on normalization-free architectures - which is precisely what DyT enables

x.com/liuzhuang1234/…
Lucas Nestler (@_clashluke) 's Twitter Profile Photo

Not all models excel at everything - specialized strengths matter. New AlgoLens benchmark data: • For creative writing: Mistral-7B delivers the best quality-to-cost ratio • For scientific texts: Phi-4 and Qwen-Turbo The data speaks for itself: x.com/TheXeophon/sta…

Not all models excel at everything - specialized strengths matter.

New <a href="/_algolens/">AlgoLens</a>  benchmark data:
• For creative writing: Mistral-7B delivers the best quality-to-cost ratio
• For scientific texts: Phi-4 and Qwen-Turbo

The data speaks for itself:
x.com/TheXeophon/sta…
Lucas Nestler (@_clashluke) 's Twitter Profile Photo

MCBench may be the first benchmark to test the emergent ability of internal vision-language alignment in the LM's world model. Under that lens, it makes sense that Claude and Gemini are currently #1, ahead of even Qwen-Max and QwQ: While Qwen has strong visual capacities, such

MCBench may be the first benchmark to test the emergent ability of internal vision-language alignment in the LM's world model.
 
Under that lens, it makes sense that Claude and Gemini are currently #1, ahead of even Qwen-Max and QwQ:

While Qwen has strong visual capacities, such
Lucas Nestler (@_clashluke) 's Twitter Profile Photo

"H200 performance [measured on H100 node]" "1.67x speedup of B200 vs H200* [after going from fp8 to fp4]" *"H100" x.com/NVIDIAAIDev/st…

"H200 performance [measured on H100 node]"
"1.67x speedup of B200 vs H200* [after going from fp8 to fp4]"
*"H100"

x.com/NVIDIAAIDev/st…
Lucas Nestler (@_clashluke) 's Twitter Profile Photo

Don't blame SOAP when HeavyBall was broken. v1.6.3 fixes our convergence regression that's been killing your training runs. Second-order optimization should be your default - now it can be

Don't blame SOAP when HeavyBall was broken.
v1.6.3 fixes our convergence regression that's been killing your training runs.

Second-order optimization should be your default - now it can be
Lucas Nestler (@_clashluke) 's Twitter Profile Photo

HeavyBall 2 pre-release is out Accelerate your einsums, SOAP and PSGD with `heavyball.utils.set_torch(einsum_strategy='heavyball')`

HeavyBall 2 pre-release is out

Accelerate your einsums, SOAP and PSGD with
`heavyball.utils.set_torch(einsum_strategy='heavyball')`
Lucas Nestler (@_clashluke) 's Twitter Profile Photo

Our "Physical Atari" demonstrates: simulation-trained agents fail in the real world Instead, our robot learns in real-time, using its own real-world experience, outperforming the state of the art Check out the annotated image and John's slides for more x.com/ID_AA_Carmack/…

Our "Physical Atari" demonstrates: simulation-trained agents fail in the real world

Instead, our robot learns in real-time, using its own real-world experience, outperforming the state of the art

Check out the annotated image and John's slides for more
x.com/ID_AA_Carmack/…