Anastasiia Filippova🇺🇦 (@nasfilippova) Twitter Tweets • TwiCopy

Anastasiia Filippova🇺🇦

@nasfilippova

+ Follow

Apple🍏, WorldQuant, EPFL, MIPT

ID: 1585699311299477504

linkhttps://anasfil.io calendar_today27-10-2022 18:26:10

140 Tweet

388 Followers

235 Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Lewis Tunstall

@_lewtun

6 months ago

We took a deep dive into the DeepSeek R1 tech report today at Hugging Face and recorded the discussion :) Let me know if you'd like us to publish our journal club more often! youtu.be/1xDVbu-WaFo

thumb_up_off_alt334

chat_bubble_outline13

repeat51

shareShare

Thrilled to share that our work No Need to Talk: Asynchronous Mixture of Language Models [arxiv.org/abs/2410.03529] has been accepted to #ICLR2025! In this paper, we explore strategies to mitigate the communication cost of large language models, both at training and inference,

thumb_up_off_alt15

chat_bubble_outline0

repeat4

shareShare

Awni Hannun

@awnihannun

6 months ago

DeepSeek R1 (the full 680B model) runs nicely in higher quality 4-bit on 3 M2 Ultras with MLX. Asked it a coding question and it thought for ~2k tokens and generated 3500 tokens overall:

thumb_up_off_alt6,6K

chat_bubble_outline161

repeat586

shareShare

Dylan Patel ✈️ ICLR

@dylan522p

6 months ago

Deepseek V3 and R1 discourse boils down to this. Shifting the curve means you build more and scale more dummies

thumb_up_off_alt4,4K

chat_bubble_outline202

repeat479

shareShare

Awni Hannun

@awnihannun

6 months ago

Someone please explain Jevons paradox to Wall Street

thumb_up_off_alt225

chat_bubble_outline27

repeat8

shareShare

Awni Hannun

@awnihannun

6 months ago

The DeepSeek V3 model file is ~450 lines of code in MLX LM. Includes pipeline-parallelism and all. Good way to see how it all works.

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat165

shareShare

Samira Abnar

@samira_abnar

6 months ago

🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute? We explored this through the lens of MoEs:

thumb_up_off_alt287

chat_bubble_outline4

repeat65

shareShare

Julia Davis

@juliadavisnews

5 months ago

“You don’t look grateful.”

thumb_up_off_alt206,206K

chat_bubble_outline732

repeat18,18K

shareShare

Rylan Schaeffer

@rylanschaeffer

3 months ago

I'm going to catch hell for posting but to summarize: 1. This paper misled its way to an #ICLR2025 Oral 2. I pointed this out 3. AC rejected the paper 4. Authors complained & somehow persuaded ICLR to overrule the AC and award a Spotlight 5. AC made clear they were overruled

thumb_up_off_alt301

chat_bubble_outline7

repeat22

shareShare