Smerity (@smerity) 's Twitter Profile
Smerity

@smerity

Focused on machine learning & society. Previously @Salesforce Research via @MetaMindIO. @Harvard '14, @Sydney_Uni '11. 🇦🇺 in SF.
Also @ smerity.bsky.social

ID: 15363432

linkhttps://state.smerity.com/ calendar_today09-07-2008 08:00:18

13,13K Tweet

31,31K Followers

2,2K Following

Smerity (@smerity) 's Twitter Profile Photo

François Chollet Excited to see the return of RNNs but wish their citations were better. Our QRNN paper (2016) has variants similar/identical to minGRU & minLSTM. RWKV, S4, Mamba et al. include citations to QRNN (2016) and SRU (2017) for a richer history + better context. arxiv.org/abs/1611.01576

<a href="/fchollet/">François Chollet</a> Excited to see the return of RNNs but wish their citations were better. Our QRNN paper (2016) has variants similar/identical to minGRU &amp; minLSTM.
RWKV, S4, Mamba et al. include citations to QRNN (2016) and SRU (2017) for a richer history + better context.
arxiv.org/abs/1611.01576
Hugh Riminton (@hughriminton) 's Twitter Profile Photo

Feel like a little outrage - how about this: pokies and gambling companies are claiming more tax breaks for R&D than some of the country’s biggest tech companies. 🤷‍♂️ ⁦Financial Review⁩ afr.com/rear-window/at…

Smerity (@smerity) 's Twitter Profile Photo

The more you use it, the more hot paths you see everywhere in Python's ecosystem - well-worn trails connecting optimized nodes, paved over time by countless developers. I argue that's Python's implicit JIT ecosystem at work. state.smerity.com/smerity/state/…

Rudy Gilman (@rgilman33) 's Twitter Profile Photo

Introducing darkspark, a gui for your neural network. It traces your pytorch code and brings up a visual representation for you to interact with. We have a hosted gallery of popular model architectures pre-traced and ready to explore. Here’s stable-diffusion-v1.5

Smerity (@smerity) 's Twitter Profile Photo

My contribution to a discussion on explorables/user interfaces for controlling ML tools > We're trying to rig soundboards to control LLMs thinking there's a well defined interface underneath when it's actually a button that drops fertilizer into the river of a complex ecosystem.

Rahel Jhirad (@raheljhirad) 's Twitter Profile Photo

Thank you Ian Johnson 🔬🤖 for organizing this … amazing unconference So friendly and so diverse a group of talented folks Leland McInnes Linus awesome keynotes and huge shout out to Leland McInnes on cool resources Great to meet .txt Chroma And see Smerity

Thank you <a href="/enjalot/">Ian Johnson 🔬🤖</a> for organizing this … amazing unconference  

So friendly and so diverse a group of talented folks 

<a href="/leland_mcinnes/">Leland McInnes</a> <a href="/thesephist/">Linus</a> awesome keynotes and huge shout out to <a href="/leland_mcinnes/">Leland McInnes</a> on cool resources  

Great to meet <a href="/dottxtai/">.txt</a> <a href="/trychroma/">Chroma</a> 
And see <a href="/Smerity/">Smerity</a>
Rudy Gilman (@rgilman33) 's Twitter Profile Photo

I came across a strange creature yesterday while tracing a circuit in DINOv2: the upside-down GELU. I’d always thought of GELU as just a smoother ReLU that died less and was easier to optimize. I thought I could ignore the tiny dip into negative territory in the same way I

I came across a strange creature yesterday while tracing a circuit in DINOv2: the upside-down GELU.

I’d always thought of GELU as just a smoother ReLU that died less and was easier to optimize. I thought I could ignore the tiny dip into negative territory in the same way I
Pieter Abbeel (@pabbeel) 's Twitter Profile Photo

Founders who were PhD or post-doc in my lab at Berkeley, **largely funded by NSF / DoD grants**, start-up, market cap (collected by OpenAI Deep Research)

Founders who were PhD or post-doc in my lab at Berkeley, **largely funded by NSF / DoD grants**, start-up, market cap (collected by OpenAI Deep Research)
François Chollet (@fchollet) 's Twitter Profile Photo

Today, we're releasing ARC-AGI-2. It's an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with. It keeps the same format as ARC-AGI-1, while significantly

Today, we're releasing ARC-AGI-2. It's an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with.

It keeps the same format as ARC-AGI-1, while significantly
Benjamin Spiegel (@superspeeg) 's Twitter Profile Photo

Why did only humans invent graphical systems like writing? 🧠✍️ In our new paper at CogSci Society, we explore how agents learn to communicate using a model of pictographic signification similar to human proto-writing. 🧵👇

Caglar Gulcehre (@caglarml) 's Twitter Profile Photo

📢I am thrilled to announce this paper. We showed that it is possible to significantly improve the FunSearch method with RL and achieve impressive algorithmic discoveries on challenging NP-complete combinatorial optimization tasks like TSP and bin-packing.

Rudy Gilman (@rgilman33) 's Twitter Profile Photo

The VAE used in SDXL has extremely high-magnitude "splotches" in its latents. The individual neurons in these blobs fire with magnitudes of close to a million. These aren't some accident of training or initialization—the model creates these high-magnitude splotches for a

Lucky Iyinbor (@luckyballa) 's Twitter Profile Photo

So Flow Matching is *just* xt = mix(x0, x1, t) loss = mse((x1 - x0) - nn(xt, t)) Nice, here it is in a fragment shader :) shadertoy.com/view/tfdXRM

Aaron Defazio (@aaron_defazio) 's Twitter Profile Photo

Why do gradients increase near the end of training? Read the paper to find out! We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training. arxiv.org/abs/2506.02285

Why do gradients increase near the end of training? 
Read the paper to find out!
We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training.
arxiv.org/abs/2506.02285
Han Guo (@hanguo97) 's Twitter Profile Photo

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?

Introducing Log-Linear Attention with:

- Log-linear time training
- Log-time inference (in both time and memory)
- Hardware-efficient Triton kernels
Albert Gu (@_albertgu) 's Twitter Profile Photo

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence.

Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Awni Hannun (@awnihannun) 's Twitter Profile Photo

The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed. 1 trillion params, at a speed that's actually quite usable: