Simone Scardapane (@s_scardapane) 's Twitter Profile
Simone Scardapane

@s_scardapane

I fall in love with a new #machinelearning topic every month 🙄 |
Researcher @SapienzaRoma | Author: Alice in a diff wonderland sscardapane.it/alice-book

ID: 1235205731747540993

linkhttps://www.sscardapane.it/ calendar_today04-03-2020 14:09:51

1,1K Tweet

11,11K Followers

659 Following

Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Memory Layers at Scale* by Vincent-Pierre Berges Barlas Oğuz Daniel Haziza Luke Zettlemoyer Gargi Ghosh They show how to scale memory layers - simple variants of attention with potentially unbounded parameter count, which can be viewed as associative memories. arxiv.org/abs/2412.09764

*Memory Layers at Scale*
by <a href="/vinceberges/">Vincent-Pierre Berges</a> <a href="/barlas_berkeley/">Barlas Oğuz</a> <a href="/d_haziza/">Daniel Haziza</a> <a href="/LukeZettlemoyer/">Luke Zettlemoyer</a> <a href="/gargighosh/">Gargi Ghosh</a>

They show how to scale memory layers - simple variants of attention with potentially unbounded parameter count, which can be viewed as associative memories.

arxiv.org/abs/2412.09764
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*The GAN is dead; long live the GAN!* by Aaron Gokaslan zonkedNGray Volodymyr Kuleshov 🇺🇦 James Tompkin A modern GAN baseline exploiting a "relativistic" loss (which avoids a minimax problem), gradient penalties, and newer backbones. openreview.net/forum?id=OrtN9…

*The GAN is dead; long live the GAN!*
by <a href="/SkyLi0n/">Aaron Gokaslan</a> <a href="/zonkedNGray/">zonkedNGray</a>  <a href="/volokuleshov/">Volodymyr Kuleshov 🇺🇦</a> <a href="/jtompkin/">James Tompkin</a> 

A modern GAN baseline exploiting a "relativistic" loss (which avoids a minimax problem), gradient penalties, and newer backbones.

openreview.net/forum?id=OrtN9…
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

Happy to share I just started as associate professor in Sapienza Università di Roma! I have now reached my perfect thermodynamical equilibrium. 😄 Also, ChatGPT's idea of me is way infinitely cooler so I'll leave it here to trick people into giving me money.

Happy to share I just started as associate professor in <a href="/SapienzaRoma/">Sapienza Università di Roma</a>! I have now reached my perfect thermodynamical equilibrium. 😄

Also, ChatGPT's idea of me is way infinitely cooler so I'll leave it here to trick people into giving me money.
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Test-time regression: a unifying framework for designing sequence models with associative memory* by Alex Wang Jiaxin Shi Cool paper showing that many sequence layers can be framed as solving an associative recall task during the forward pass. arxiv.org/abs/2501.12352

*Test-time regression: a unifying framework for designing sequence models with associative memory*
by <a href="/heyyalexwang/">Alex Wang</a> <a href="/thjashin/">Jiaxin Shi</a>

Cool paper showing that many sequence layers can be framed as solving an associative recall task during the forward pass.

arxiv.org/abs/2501.12352
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Interpretability in Parameter Space* by Dan Braun Lee Sharkey & al. They look for "interpretable" components directly in weight space of a model, by optimizing for several desiderata (faithfulness, low-rank, etc.). For now only works on toy tasks. arxiv.org/abs/2501.14926

*Interpretability in Parameter Space*
by <a href="/danbraunai/">Dan Braun</a> <a href="/leedsharkey/">Lee Sharkey</a> &amp; al.

They look for "interpretable" components directly in  weight space of a model, by optimizing for several desiderata (faithfulness, low-rank, etc.). For now only works on toy tasks.

arxiv.org/abs/2501.14926
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Scalable-Softmax Is Superior for Attention* by Ken Nakanishi Scaling the softmax inputs by the sequence length may improve generalization to longer sequences. No idea if novel but so cool to see a 1 author paper on new components - reminds me of 2014! arxiv.org/abs/2501.19399

*Scalable-Softmax Is Superior for Attention*
by Ken Nakanishi

Scaling the softmax inputs by the sequence length may improve generalization to longer sequences. No idea if novel but so cool to see a 1 author paper on new components - reminds me of 2014!

arxiv.org/abs/2501.19399
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Language Models Use Trigonometry to Do Addition* by Subhash Kantamneni Max Tegmark Really interesting paper that retro-engineers several LLMs to understand how single-token addition is performed in the internal embeddings. arxiv.org/abs/2502.00873

*Language Models Use Trigonometry to Do Addition*
by <a href="/thesubhashk/">Subhash Kantamneni</a> <a href="/tegmark/">Max Tegmark</a>

Really interesting paper that retro-engineers several LLMs to understand how single-token addition is performed in the internal embeddings.

arxiv.org/abs/2502.00873
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Universal Sparse Autoencoders* by Harry Thasarathan Thomas Fel Matthew Kowal Kosta Derpanis They train a shared SAE latent space on several vision encoders at once, showing, e.g., how the same concept activates in different models. arxiv.org/abs/2502.03714

*Universal Sparse Autoencoders*
by <a href="/HThasarathan/">Harry Thasarathan</a> <a href="/Napoolar/">Thomas Fel</a> <a href="/MatthewKowal9/">Matthew Kowal</a> <a href="/CSProfKGD/">Kosta Derpanis</a> 

They train a shared SAE latent space on several vision encoders at once, showing, e.g., how the same concept activates in different models.

arxiv.org/abs/2502.03714
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Recursive Inference Scaling* by Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن Xiaohua Zhai Recursively applying the first part of a model can be a strong compute-efficient baseline in many scenarios, when evaluating on a fixed compute budget. arxiv.org/abs/2502.07503

*Recursive Inference Scaling*
by <a href="/ibomohsin/">Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن</a> <a href="/XiaohuaZhai/">Xiaohua Zhai</a>

Recursively applying the first part of a model can be a strong compute-efficient baseline in many scenarios, when evaluating on a fixed compute budget.

arxiv.org/abs/2502.07503
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Rethinking Early Stopping: Refine, Then Calibrate* by Eugène Berta Leshem (Legend) Choshen 🤖🤗 David Holzmüller Francis Bach Doing early stopping on the "refinement loss" (original loss modulo calibration loss) is beneficial for both accuracy and calibration. arxiv.org/abs/2501.19195

*Rethinking Early Stopping: Refine, Then Calibrate*
by <a href="/Eugene_Berta/">Eugène Berta</a> <a href="/LChoshen/">Leshem (Legend) Choshen 🤖🤗</a> <a href="/DHolzmueller/">David Holzmüller</a> <a href="/BachFrancis/">Francis Bach</a>

Doing early stopping on the "refinement loss" (original loss modulo calibration loss) is beneficial for both accuracy and calibration. 

arxiv.org/abs/2501.19195
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features* by Alec Helbling Tuna Meral Ben Hoover Pinar Yanardag 🚀 CVPR2025 Duen Horng "Polo" Chau Creates saliency maps for diffusion ViTs by propagating concepts (eg, car) and repurposing cross-attention layers. arxiv.org/abs/2502.04320

*ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features*
by <a href="/alec_helbling/">Alec Helbling</a> <a href="/tunahansalih/">Tuna Meral</a> <a href="/Ben_Hoov/">Ben Hoover</a> <a href="/PINguAR/">Pinar Yanardag 🚀 CVPR2025</a> <a href="/PoloChau/">Duen Horng "Polo" Chau</a>

Creates saliency maps for diffusion ViTs by propagating concepts (eg, car) and repurposing cross-attention layers.

arxiv.org/abs/2502.04320
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*From superposition to sparse codes: interpretable representations in NNs* by David Klindt Nina Miolane 🦋 @ninamiolane.bsky.social Patrik Reizinger Charlie O'Neill Nice overview on the linearity of NN representations and the use of sparse coding to recover interpretable activations. arxiv.org/abs/2503.01824

*From superposition to sparse codes: interpretable representations in NNs*
by <a href="/klindt_david/">David Klindt</a> <a href="/ninamiolane/">Nina Miolane 🦋 @ninamiolane.bsky.social</a> <a href="/rpatrik96/">Patrik Reizinger</a> <a href="/charles0neill/">Charlie O'Neill</a> 

Nice overview on the linearity of NN representations and the use of sparse coding to recover interpretable activations.

arxiv.org/abs/2503.01824
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Differentiable Logic Cellular Automata* by Pietro Miotti Eyvind Niklasson Ettore Randazzo Alex Mordvintsev Combines differentiable cellular automata and logic circuits to learn recurrent circuits exhibiting complex (learned) behavior. google-research.github.io/self-organisin…

*Differentiable Logic Cellular Automata*
by <a href="/PietroMiotti/">Pietro Miotti</a> <a href="/eyvindn/">Eyvind Niklasson</a> <a href="/RandazzoEttore/">Ettore Randazzo</a> <a href="/zzznah/">Alex Mordvintsev</a>

Combines differentiable cellular automata and logic circuits to learn recurrent circuits exhibiting complex (learned) behavior.

google-research.github.io/self-organisin…
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Generalized Interpolating Discrete Diffusion* by Dimitri von Rütte Antonio Orvieto & al. A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens. arxiv.org/abs/2503.04482

*Generalized Interpolating Discrete Diffusion*
by <a href="/dvruette/">Dimitri von Rütte</a> <a href="/orvieto_antonio/">Antonio Orvieto</a> &amp; al.

A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens.

arxiv.org/abs/2503.04482
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

Twitter friends, here's some draft notes for my upcoming course on automatic differentiation, mostly based on the "Elements of differentiable programming" book. Let me know what you think! They also include a notebook on operator overloading. 🙃 notion.so/sscardapane/Au…

Twitter friends, here's some draft notes for my upcoming course on automatic differentiation, mostly  based on the "Elements of differentiable programming" book. Let me know what you think! They also include a notebook on operator overloading. 🙃

notion.so/sscardapane/Au…
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*NoProp: Training Neural Networks without Backpropagation or Forward-propagation* by Yee Whye Teh et al. They use a neural network to define a denoising process over the class labels, which allows them to train the blocks independently (i.e., "no backprop"). arxiv.org/abs/2503.24322

*NoProp: Training Neural Networks without Backpropagation or Forward-propagation*
by <a href="/yeewhye/">Yee Whye Teh</a> et al.

They use a neural network to define a denoising process over the class labels, which allows them to train the blocks independently (i.e., "no backprop").

arxiv.org/abs/2503.24322