Samira Abnar (@samira_abnar) 's Twitter Profile
Samira Abnar

@samira_abnar

Apple ML research

ID: 1798519229635014656

calendar_today06-06-2024 00:56:49

40 Tweet

224 Followers

182 Following

Enrico Fini (@donkeyshot21) 's Twitter Profile Photo

We release AIMv2, the second iteration of the AIM family of large autoregressive vision encoders. This time we bring multimodality into the game 🔥 Paper: arxiv.org/abs/2411.14402 Repo: github.com/apple/ml-aim Model Gallery: huggingface.co/collections/ap…

We release AIMv2, the second iteration of the AIM family of large autoregressive vision encoders. This time we bring multimodality into the game 🔥

Paper: arxiv.org/abs/2411.14402
Repo: github.com/apple/ml-aim
Model Gallery: huggingface.co/collections/ap…
Anastasiia Filippova🇺🇦 (@nasfilippova) 's Twitter Profile Photo

Thrilled to share that our work No Need to Talk: Asynchronous Mixture of Language Models [arxiv.org/abs/2410.03529] has been accepted to #ICLR2025! In this paper, we explore strategies to mitigate the communication cost of large language models, both at training and inference,

Vimal Thilak🦉🐒 (@aggieinca) 's Twitter Profile Photo

Mixture of experts is an interesting architecture or so Samira Abnar told me when I joined the project last year. After some brilliant work from Harshay Shah and Samira, we have a scaling law paper!

Mostafa Dehghani (@m__dehghani) 's Twitter Profile Photo

This is neat! Finding the sweet spot between params & FLOPs by tweaking sparsity (FLOPs-free-params). Exploring parameter-free FLOPs, like attention FLOPs (depend on seq len) or parameter reuse, could be an exciting angle for this trade-off too. x.com/samira_abnar/s…

Miguel Angel Bautista (@itsbautistam) 's Twitter Profile Photo

Thuerey Group at TUM Nice work and cool results! FWIW you can train a diffusion model directly in data space (and with general geometries) without requiring to compress data into a latent space arxiv.org/abs/2305.15586. Perhaps Fig. 17 is the closest to your experimental setting (cc Ahmed Elhag)

<a href="/thuereyGroup/">Thuerey Group at TUM</a> Nice work and cool results! FWIW you can train a diffusion model directly in data space (and with general geometries) without requiring to compress data into a latent space arxiv.org/abs/2305.15586. Perhaps Fig. 17 is the closest to your experimental setting (cc <a href="/Ahmed_AI035/">Ahmed Elhag</a>)
Dan Busbridge (@danbusbridge) 's Twitter Profile Photo

Particularly in the last few months, as test-time compute scaling is taking off, understanding parameter and FLOP trade-offs has become increasingly important. This work led by Samira Abnar with fantastic contributions from Harshay Shah , provides a deep dive into this topic.

Harshay Shah (@harshays_) 's Twitter Profile Photo

MoEs provide two knobs for scaling: model size (total params) + FLOPs-per-token (via active params). What’s the right scaling strategy? And how does it depend on the pretraining budget? Our work introduces sparsity-aware scaling laws for MoE LMs to tackle these questions! 🧵👇

Jonas Geiping (@jonasgeiping) 's Twitter Profile Photo

Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛

Ok, so I can finally talk about this! 

We spent the last year (actually  a bit longer) training an  LLM with recurrent depth at scale.

The model has an internal latent space in which it can adaptively spend more compute to think longer. 

I think the tech report ...🐦‍⬛
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Apple presents: Distillation Scaling Laws Presents a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher

Apple presents:

Distillation Scaling Laws

Presents a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher
Dan Busbridge (@danbusbridge) 's Twitter Profile Photo

Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵 arxiv.org/abs/2502.08606

Eeshan Gunesh Dhekane (@eeshandhekane) 's Twitter Profile Photo

Parameterized Transforms 🚀 Here is a new tool that provides a modular and extendable implementation of torchvision-based image augmentations that provides access to their parameterization. [1/5]

Parameterized Transforms 🚀

Here is a new tool that provides a modular and extendable implementation of torchvision-based image augmentations that provides access to their parameterization. [1/5]
Yuyang Wang (@yuyangw95) 's Twitter Profile Photo

We’re looking for an intern at Apple MLR 🍎 ASAP. Join us if interested in building universal diffusion/flow-matching model at scale!

Pau Rodríguez (@prlz77) 's Twitter Profile Photo

Our work on fine-grained control of LLMs and diffusion models via Activation Transport will be presented ICLR 2025 as spotlight✨Check out our new blog post machinelearning.apple.com/research/trans…

Mustafa Shukor (@mustafashukor1) 's Twitter Profile Photo

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

We release a large scale study to answer the following:
- Is late fusion inherently better than early fusion for multimodal models?
- How do native multimodal models scale compared to LLMs.
- How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
Enrico Fini (@donkeyshot21) 's Twitter Profile Photo

Training and scaling large multimodal models from scratch? This is the thread for you. In this new paper, we provide an extensive study with hundreds of runs, fitting scaling laws for early/late fusion models, MoEs, and exploring different data mixtures. Tons of cool findings.

Vimal Thilak🦉🐒 (@aggieinca) 's Twitter Profile Photo

Check out this post that has information about research from Apple that will be presented at ICLR 2025 in 🇸🇬 this week. I will be at ICLR and will be presenting some of our work (led by Samira Abnar) at SLLM Sparsity in LLMs Workshop at ICLR 2025 workshop. Happy to chat about JEPAs as well!

Awni Hannun (@awnihannun) 's Twitter Profile Photo

We have two awesome new videos on MLX at #WWDC25 this year. - Learn all about MLX. - Learn all about running LLMs locally with MLX. Angelos Katharopoulos, Shashank Prasanna, myself, and others worked super hard to make these. Check them out and hope you find them useful!

We have two awesome new videos on MLX at #WWDC25 this year.

- Learn all about MLX. 
- Learn all about running LLMs locally with MLX.

<a href="/angeloskath/">Angelos Katharopoulos</a>, <a href="/shshnkp/">Shashank Prasanna</a>, myself, and others worked super hard to make these. Check them out and hope you find them useful!