Simone Scardapane (@s_scardapane) Twitter Tweets • TwiCopy

Simone Scardapane

4 months ago

*Memory Layers at Scale* by Vincent-Pierre Berges Barlas Oğuz Daniel Haziza Luke Zettlemoyer Gargi Ghosh They show how to scale memory layers - simple variants of attention with potentially unbounded parameter count, which can be viewed as associative memories. arxiv.org/abs/2412.09764

*Memory Layers at Scale*
by <a href="/vinceberges/">Vincent-Pierre Berges</a> <a href="/barlas_berkeley/">Barlas Oğuz</a> <a href="/d_haziza/">Daniel Haziza</a> <a href="/LukeZettlemoyer/">Luke Zettlemoyer</a> <a href="/gargighosh/">Gargi Ghosh</a>

They show how to scale memory layers - simple variants of attention with potentially unbounded parameter count, which can be viewed as associative memories.

arxiv.org/abs/2412.09764

thumb_up_off_alt194

chat_bubble_outline0

repeat32

shareShare

Simone Scardapane

@s_scardapane

4 months ago

"Merge me up, Scotty!" ... Sorry 🥺

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*The GAN is dead; long live the GAN!* by Aaron Gokaslan zonkedNGray Volodymyr Kuleshov 🇺🇦 James Tompkin A modern GAN baseline exploiting a "relativistic" loss (which avoids a minimax problem), gradient penalties, and newer backbones. openreview.net/forum?id=OrtN9…

*The GAN is dead; long live the GAN!*
by <a href="/SkyLi0n/">Aaron Gokaslan</a> <a href="/zonkedNGray/">zonkedNGray</a> <a href="/volokuleshov/">Volodymyr Kuleshov 🇺🇦</a> <a href="/jtompkin/">James Tompkin</a>

A modern GAN baseline exploiting a "relativistic" loss (which avoids a minimax problem), gradient penalties, and newer backbones.

openreview.net/forum?id=OrtN9…

thumb_up_off_alt184

chat_bubble_outline4

repeat32

shareShare

Simone Scardapane

@s_scardapane

4 months ago

Happy to share I just started as associate professor in Sapienza Università di Roma! I have now reached my perfect thermodynamical equilibrium. 😄 Also, ChatGPT's idea of me is way infinitely cooler so I'll leave it here to trick people into giving me money.

Happy to share I just started as associate professor in <a href="/SapienzaRoma/">Sapienza Università di Roma</a>! I have now reached my perfect thermodynamical equilibrium. 😄

Also, ChatGPT's idea of me is way infinitely cooler so I'll leave it here to trick people into giving me money.

thumb_up_off_alt109

chat_bubble_outline12

repeat4

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*Test-time regression: a unifying framework for designing sequence models with associative memory* by Alex Wang Jiaxin Shi Cool paper showing that many sequence layers can be framed as solving an associative recall task during the forward pass. arxiv.org/abs/2501.12352

*Test-time regression: a unifying framework for designing sequence models with associative memory*
by <a href="/heyyalexwang/">Alex Wang</a> <a href="/thjashin/">Jiaxin Shi</a>

Cool paper showing that many sequence layers can be framed as solving an associative recall task during the forward pass.

arxiv.org/abs/2501.12352

thumb_up_off_alt124

chat_bubble_outline1

repeat14

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*Interpretability in Parameter Space* by Dan Braun Lee Sharkey & al. They look for "interpretable" components directly in weight space of a model, by optimizing for several desiderata (faithfulness, low-rank, etc.). For now only works on toy tasks. arxiv.org/abs/2501.14926

*Interpretability in Parameter Space*
by <a href="/danbraunai/">Dan Braun</a> <a href="/leedsharkey/">Lee Sharkey</a> & al.

They look for "interpretable" components directly in weight space of a model, by optimizing for several desiderata (faithfulness, low-rank, etc.). For now only works on toy tasks.

arxiv.org/abs/2501.14926

thumb_up_off_alt250

chat_bubble_outline2

repeat41

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*Scalable-Softmax Is Superior for Attention* by Ken Nakanishi Scaling the softmax inputs by the sequence length may improve generalization to longer sequences. No idea if novel but so cool to see a 1 author paper on new components - reminds me of 2014! arxiv.org/abs/2501.19399

thumb_up_off_alt177

chat_bubble_outline5

repeat23

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*Language Models Use Trigonometry to Do Addition* by Subhash Kantamneni Max Tegmark Really interesting paper that retro-engineers several LLMs to understand how single-token addition is performed in the internal embeddings. arxiv.org/abs/2502.00873

*Language Models Use Trigonometry to Do Addition*
by <a href="/thesubhashk/">Subhash Kantamneni</a> <a href="/tegmark/">Max Tegmark</a>

Really interesting paper that retro-engineers several LLMs to understand how single-token addition is performed in the internal embeddings.

arxiv.org/abs/2502.00873

thumb_up_off_alt218

chat_bubble_outline3

repeat48

shareShare

Laurent Sartran

@laurentsartran

4 months ago

Simone Scardapane That was covered by David Chiang and Peter Cholak in 2022 in Overcoming a Theoretical Limitation of Self-Attention. arxiv.org/abs/2202.12172

<a href="/s_scardapane/">Simone Scardapane</a> That was covered by <a href="/davidweichiang/">David Chiang</a> and Peter Cholak in 2022 in Overcoming a Theoretical Limitation of Self-Attention. arxiv.org/abs/2202.12172

thumb_up_off_alt20

chat_bubble_outline1

repeat3

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*Universal Sparse Autoencoders* by Harry Thasarathan Thomas Fel Matthew Kowal Kosta Derpanis They train a shared SAE latent space on several vision encoders at once, showing, e.g., how the same concept activates in different models. arxiv.org/abs/2502.03714

*Universal Sparse Autoencoders*
by <a href="/HThasarathan/">Harry Thasarathan</a> <a href="/Napoolar/">Thomas Fel</a> <a href="/MatthewKowal9/">Matthew Kowal</a> <a href="/CSProfKGD/">Kosta Derpanis</a>

They train a shared SAE latent space on several vision encoders at once, showing, e.g., how the same concept activates in different models.

arxiv.org/abs/2502.03714

thumb_up_off_alt255

chat_bubble_outline3

repeat35

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*Recursive Inference Scaling* by Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن Xiaohua Zhai Recursively applying the first part of a model can be a strong compute-efficient baseline in many scenarios, when evaluating on a fixed compute budget. arxiv.org/abs/2502.07503

*Recursive Inference Scaling*
by <a href="/ibomohsin/">Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن</a> <a href="/XiaohuaZhai/">Xiaohua Zhai</a>

Recursively applying the first part of a model can be a strong compute-efficient baseline in many scenarios, when evaluating on a fixed compute budget.

arxiv.org/abs/2502.07503

thumb_up_off_alt77

chat_bubble_outline0

repeat7

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*Rethinking Early Stopping: Refine, Then Calibrate* by Eugène Berta Leshem (Legend) Choshen 🤖🤗 David Holzmüller Francis Bach Doing early stopping on the "refinement loss" (original loss modulo calibration loss) is beneficial for both accuracy and calibration. arxiv.org/abs/2501.19195

*Rethinking Early Stopping: Refine, Then Calibrate*
by <a href="/Eugene_Berta/">Eugène Berta</a> <a href="/LChoshen/">Leshem (Legend) Choshen 🤖🤗</a> <a href="/DHolzmueller/">David Holzmüller</a> <a href="/BachFrancis/">Francis Bach</a>

Doing early stopping on the "refinement loss" (original loss modulo calibration loss) is beneficial for both accuracy and calibration.

arxiv.org/abs/2501.19195

thumb_up_off_alt25

chat_bubble_outline1

repeat2

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features* by Alec Helbling Tuna Meral Ben Hoover Pinar Yanardag 🚀 CVPR2025 Duen Horng "Polo" Chau Creates saliency maps for diffusion ViTs by propagating concepts (eg, car) and repurposing cross-attention layers. arxiv.org/abs/2502.04320

*ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features*
by <a href="/alec_helbling/">Alec Helbling</a> <a href="/tunahansalih/">Tuna Meral</a> <a href="/Ben_Hoov/">Ben Hoover</a> <a href="/PINguAR/">Pinar Yanardag 🚀 CVPR2025</a> <a href="/PoloChau/">Duen Horng "Polo" Chau</a>

Creates saliency maps for diffusion ViTs by propagating concepts (eg, car) and repurposing cross-attention layers.

arxiv.org/abs/2502.04320

thumb_up_off_alt345

chat_bubble_outline3

repeat36

shareShare

Simone Scardapane

@s_scardapane

4 months ago

*From superposition to sparse codes: interpretable representations in NNs* by David Klindt Nina Miolane 🦋 @ninamiolane.bsky.social Patrik Reizinger Charlie O'Neill Nice overview on the linearity of NN representations and the use of sparse coding to recover interpretable activations. arxiv.org/abs/2503.01824

*From superposition to sparse codes: interpretable representations in NNs*
by <a href="/klindt_david/">David Klindt</a> <a href="/ninamiolane/">Nina Miolane 🦋 @ninamiolane.bsky.social</a> <a href="/rpatrik96/">Patrik Reizinger</a> <a href="/charles0neill/">Charlie O'Neill</a>

Nice overview on the linearity of NN representations and the use of sparse coding to recover interpretable activations.

arxiv.org/abs/2503.01824

thumb_up_off_alt147

chat_bubble_outline2

repeat20

shareShare

Simone Scardapane

@s_scardapane

3 months ago

*Differentiable Logic Cellular Automata* by Pietro Miotti Eyvind Niklasson Ettore Randazzo Alex Mordvintsev Combines differentiable cellular automata and logic circuits to learn recurrent circuits exhibiting complex (learned) behavior. google-research.github.io/self-organisin…

*Differentiable Logic Cellular Automata*
by <a href="/PietroMiotti/">Pietro Miotti</a> <a href="/eyvindn/">Eyvind Niklasson</a> <a href="/RandazzoEttore/">Ettore Randazzo</a> <a href="/zzznah/">Alex Mordvintsev</a>

Combines differentiable cellular automata and logic circuits to learn recurrent circuits exhibiting complex (learned) behavior.

google-research.github.io/self-organisin…

thumb_up_off_alt217

chat_bubble_outline2

repeat33

shareShare

Simone Scardapane

@s_scardapane

3 months ago

*Generalized Interpolating Discrete Diffusion* by Dimitri von Rütte Antonio Orvieto & al. A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens. arxiv.org/abs/2503.04482

*Generalized Interpolating Discrete Diffusion*
by <a href="/dvruette/">Dimitri von Rütte</a> <a href="/orvieto_antonio/">Antonio Orvieto</a> & al.

A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens.

arxiv.org/abs/2503.04482

thumb_up_off_alt142

chat_bubble_outline1

repeat20

shareShare

Simone Scardapane

@s_scardapane

3 months ago

Twitter friends, here's some draft notes for my upcoming course on automatic differentiation, mostly based on the "Elements of differentiable programming" book. Let me know what you think! They also include a notebook on operator overloading. 🙃 notion.so/sscardapane/Au…

thumb_up_off_alt25

chat_bubble_outline0

repeat7

shareShare

Simone Scardapane

@s_scardapane

3 months ago

*NoProp: Training Neural Networks without Backpropagation or Forward-propagation* by Yee Whye Teh et al. They use a neural network to define a denoising process over the class labels, which allows them to train the blocks independently (i.e., "no backprop"). arxiv.org/abs/2503.24322

*NoProp: Training Neural Networks without Backpropagation or Forward-propagation*
by <a href="/yeewhye/">Yee Whye Teh</a> et al.

They use a neural network to define a denoising process over the class labels, which allows them to train the blocks independently (i.e., "no backprop").

arxiv.org/abs/2503.24322

thumb_up_off_alt225

chat_bubble_outline0

repeat49

shareShare