Antonio Orvieto (@orvieto_antonio) 's Twitter Profile
Antonio Orvieto

@orvieto_antonio

Deep Learning PI @ELLISInst_Tue, Group Leader @MPI_IS.
I compute stuff with lots of gradients 🧮,
I like Kierkegaard & Lévi-Strauss 🧙‍♂️

ID: 1172891108076077057

linkhttp://orvi.altervista.org/ calendar_today14-09-2019 15:13:38

303 Tweet

1,1K Followers

1,1K Following

Damien Ferbach (@damien_ferbach) 's Twitter Profile Photo

It's very difficult to improve the *exponent* in scaling laws for loss vs compute, especially by changing the optimizer! Our new paper shows that scaling momentum correctly can *provably* improve the scaling exponent on a theoretical model. Empirically, it works on LSTMs too!

It's very difficult to improve the *exponent* in scaling laws for loss vs compute, especially by changing the optimizer!
Our new paper shows that scaling momentum correctly can *provably* improve the scaling exponent on a theoretical model. Empirically, it works on LSTMs too!
Antonio Orvieto (@orvieto_antonio) 's Twitter Profile Photo

join us tonight to talk about Adam! maybe we will touch a bit on Muon & friends -- they carry many of the open questions we have about Adam ❤️ thanks Yannic

Jonas Geiping (@jonasgeiping) 's Twitter Profile Photo

Forecasting future events is a fascinating task for language models. Arguably the hardest application for a pure "oracle" that can't take actions; requiring reasoning about conflicting info, planning, information seeking... But, forecasting is also uniquely hard to evaluate:

Antonio Orvieto (@orvieto_antonio) 's Twitter Profile Photo

It is essential to thoroughly evaluate, test, and compare ideas. This unbiased process is rare in modern research. Niccolò did this for averaging checkpoints: with a large number of experiments, demonstrating when and where to average weights in modern large-scale setups. Super

Robert Lange (@roberttlange) 's Twitter Profile Photo

Text-to-LoRA: What if you no longer had to fine-tune your LLM for every single downstream task? 🚀 Stoked to share our work on instant LLM adaptation using meta-learned hypernetworks 📝 → 🔥 The idea is simple yet elegant: We text-condition a hypernetwork to output LoRA

Text-to-LoRA: What if you no longer had to fine-tune your LLM for every single downstream task?

🚀 Stoked to share our work on instant LLM adaptation using meta-learned hypernetworks 📝 →  🔥

The idea is simple yet elegant: We text-condition a hypernetwork to output LoRA
Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Generalized Interpolating Discrete Diffusion* by Dimitri von Rütte Antonio Orvieto & al. A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens. arxiv.org/abs/2503.04482

*Generalized Interpolating Discrete Diffusion*
by <a href="/dvruette/">Dimitri von Rütte</a> <a href="/orvieto_antonio/">Antonio Orvieto</a> &amp; al.

A class of discrete diffusion models combining standard masking with uniform noise to allow the model to potentially "correct" previously wrong tokens.

arxiv.org/abs/2503.04482