Samira Abnar (@samira_abnar) Twitter Tweets • TwiCopy

Enrico Fini

10 months ago

We release AIMv2, the second iteration of the AIM family of large autoregressive vision encoders. This time we bring multimodality into the game 🔥 Paper: arxiv.org/abs/2411.14402 Repo: github.com/apple/ml-aim Model Gallery: huggingface.co/collections/ap…

thumb_up_off_alt168

chat_bubble_outline6

repeat35

shareShare

Anastasiia Filippova🇺🇦

@nasfilippova

8 months ago

Thrilled to share that our work No Need to Talk: Asynchronous Mixture of Language Models [arxiv.org/abs/2410.03529] has been accepted to #ICLR2025! In this paper, we explore strategies to mitigate the communication cost of large language models, both at training and inference,

thumb_up_off_alt15

chat_bubble_outline0

repeat4

shareShare

Vimal Thilak🦉🐒

@aggieinca

7 months ago

Mixture of experts is an interesting architecture or so Samira Abnar told me when I joined the project last year. After some brilliant work from Harshay Shah and Samira, we have a scaling law paper!

thumb_up_off_alt9

chat_bubble_outline0

repeat3

shareShare

Alaa El-Nouby

@alaa_nouby

7 months ago

Really interesting work studying the optimal sparsity levels for MoEs under different compute budgets, work led by the powerful Samira Abnar Harshay Shah Vimal Thilak🦉🐒

thumb_up_off_alt16

chat_bubble_outline0

repeat4

shareShare

Mostafa Dehghani

@m__dehghani

7 months ago

This is neat! Finding the sweet spot between params & FLOPs by tweaking sparsity (FLOPs-free-params). Exploring parameter-free FLOPs, like attention FLOPs (depend on seq len) or parameter reuse, could be an exciting angle for this trade-off too. x.com/samira_abnar/s…

thumb_up_off_alt57

chat_bubble_outline1

repeat6

shareShare

Miguel Angel Bautista

@itsbautistam

7 months ago

Thuerey Group at TUM Nice work and cool results! FWIW you can train a diffusion model directly in data space (and with general geometries) without requiring to compress data into a latent space arxiv.org/abs/2305.15586. Perhaps Fig. 17 is the closest to your experimental setting (cc Ahmed Elhag)

<a href="/thuereyGroup/">Thuerey Group at TUM</a> Nice work and cool results! FWIW you can train a diffusion model directly in data space (and with general geometries) without requiring to compress data into a latent space arxiv.org/abs/2305.15586. Perhaps Fig. 17 is the closest to your experimental setting (cc <a href="/Ahmed_AI035/">Ahmed Elhag</a>)

thumb_up_off_alt7

chat_bubble_outline0

repeat2

shareShare

Dan Busbridge

@danbusbridge

7 months ago

Particularly in the last few months, as test-time compute scaling is taking off, understanding parameter and FLOP trade-offs has become increasingly important. This work led by Samira Abnar with fantastic contributions from Harshay Shah , provides a deep dive into this topic.

thumb_up_off_alt12

chat_bubble_outline1

repeat4

shareShare

Harshay Shah

@harshays_

7 months ago

MoEs provide two knobs for scaling: model size (total params) + FLOPs-per-token (via active params). What’s the right scaling strategy? And how does it depend on the pretraining budget? Our work introduces sparsity-aware scaling laws for MoE LMs to tackle these questions! 🧵👇

thumb_up_off_alt35

chat_bubble_outline1

repeat5

shareShare

Preetum Nakkiran

@preetumnakkiran

7 months ago

finally managed to sneak my dog into a paper

thumb_up_off_alt1,1K

chat_bubble_outline35

repeat50

shareShare

Jonas Geiping

@jonasgeiping

7 months ago

Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛

thumb_up_off_alt2,2K

chat_bubble_outline51

repeat200

shareShare

Aran Komatsuzaki

@arankomatsuzaki

7 months ago

Apple presents: Distillation Scaling Laws Presents a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher

thumb_up_off_alt1,1K

chat_bubble_outline16

repeat205

shareShare

Dan Busbridge

@danbusbridge

7 months ago

Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵 arxiv.org/abs/2502.08606

thumb_up_off_alt1,1K

chat_bubble_outline12

repeat145

shareShare

Eeshan Gunesh Dhekane

@eeshandhekane

7 months ago

Parameterized Transforms 🚀 Here is a new tool that provides a modular and extendable implementation of torchvision-based image augmentations that provides access to their parameterization. [1/5]

thumb_up_off_alt14

chat_bubble_outline1

repeat9

shareShare

Yuyang Wang

@yuyangw95

7 months ago

We’re looking for an intern at Apple MLR 🍎 ASAP. Join us if interested in building universal diffusion/flow-matching model at scale!

thumb_up_off_alt73

chat_bubble_outline1

repeat9

shareShare

Pau Rodríguez

@prlz77

5 months ago

Our work on fine-grained control of LLMs and diffusion models via Activation Transport will be presented ICLR 2025 as spotlight✨Check out our new blog post machinelearning.apple.com/research/trans…

thumb_up_off_alt40

chat_bubble_outline1

repeat10

shareShare

Mustafa Shukor

@mustafashukor1

5 months ago

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

thumb_up_off_alt428

chat_bubble_outline8

repeat73

shareShare

Enrico Fini

@donkeyshot21

5 months ago

Training and scaling large multimodal models from scratch? This is the thread for you. In this new paper, we provide an extensive study with hundreds of runs, fitting scaling laws for early/late fusion models, MoEs, and exploring different data mixtures. Tons of cool findings.

thumb_up_off_alt94

chat_bubble_outline3

repeat11

shareShare

Vimal Thilak🦉🐒

@aggieinca

5 months ago

Check out this post that has information about research from Apple that will be presented at ICLR 2025 in 🇸🇬 this week. I will be at ICLR and will be presenting some of our work (led by Samira Abnar) at SLLM Sparsity in LLMs Workshop at ICLR 2025 workshop. Happy to chat about JEPAs as well!

thumb_up_off_alt19

chat_bubble_outline0

repeat6

shareShare

Awni Hannun

@awnihannun

3 months ago

We have two awesome new videos on MLX at #WWDC25 this year. - Learn all about MLX. - Learn all about running LLMs locally with MLX. Angelos Katharopoulos, Shashank Prasanna, myself, and others worked super hard to make these. Check them out and hope you find them useful!

We have two awesome new videos on MLX at #WWDC25 this year.

- Learn all about MLX.
- Learn all about running LLMs locally with MLX.

<a href="/angeloskath/">Angelos Katharopoulos</a>, <a href="/shshnkp/">Shashank Prasanna</a>, myself, and others worked super hard to make these. Check them out and hope you find them useful!

thumb_up_off_alt441

chat_bubble_outline25

repeat54

shareShare