Yuval Kirstain (@ykirstain) Twitter Tweets • TwiCopy

Omri Avrahami

9 months ago

[1/10] 🚨 We present our recent Snap Inc. project: Stable Flow --- A training-free method that performs various types of image editing operations (e.g., non-rigid editing, object addition and replacement) using flow models. Project page: omriavrahami.com/stable-flow

thumb_up_off_alt289

chat_bubble_outline4

repeat65

shareShare

Yuval Kirstain

@ykirstain

8 months ago

.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Guy Yariv

@guy_yariv

8 months ago

[1/8] Recent work has shown impressive Image-to-Video (I2V) generation results. However, accurately articulating multiple interacting objects and complex motions remains challenging. In our new work, we take a step toward addressing this challenge.

thumb_up_off_alt80

chat_bubble_outline7

repeat26

shareShare

Ziqi Huang

@ziqi_huang_

8 months ago

🎥 𝗩𝗕𝗲𝗻𝗰𝗵 𝗔𝗿𝗲𝗻𝗮: 𝗪𝗮𝘁𝗰𝗵 𝗔𝗜-𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲𝗱 𝗩𝗶𝗱𝗲𝗼𝘀 𝗜𝗻𝘀𝘁𝗮𝗻𝘁𝗹𝘆 🎥 ✅ 180,000+ AI-generated videos, 40+ models (and growing) ✅ You can optionally vote for your preferred outputs Try it here: huggingface.co/spaces/Vchitec…

thumb_up_off_alt69

chat_bubble_outline4

repeat25

shareShare

Yaron Lipman

@lipmanya

7 months ago

Our **Flow Matching Tutorial** from #NeurIPS2024 is now publicly available: neurips.cc/virtual/2024/t… Heli Ben-Hamu Ricky T. Q. Chen

thumb_up_off_alt530

chat_bubble_outline4

repeat92

shareShare

Rohit Girdhar

@_rohitgirdhar_

7 months ago

Super excited to share some recent work that shows that pure, text-only LLMs, can see and hear without any training! Our approach, called "MILS", uses LLMs with off-the-shelf multimodal models, to caption images/videos/audio, improve image generation, style transfer, and more!

thumb_up_off_alt247

chat_bubble_outline7

repeat38

shareShare

Yinbo Chen

@yinbochen

7 months ago

Introducing “Diffusion Autoencoders are Scalable Image Tokenizers” (DiTo). We show that with proper designs and scaling up, diffusion autoencoders (a single L2 loss) can outperform the GAN-LPIPS tokenizers (hybrid losses) used in current SOTA generative models. (1/4)

thumb_up_off_alt510

chat_bubble_outline4

repeat104

shareShare

Ishan Misra

@imisra_

7 months ago

Tokenizers in image/video generation are way understudied! "Standard" recipe: combinatorial search over different losses using a plethora of models DiTo is our attempt to break-away from this: simpler, scalable, and theoretically sound! Idea: Use diffusion to learn the tokens

thumb_up_off_alt277

chat_bubble_outline3

repeat41

shareShare

Yuval Kirstain

@ykirstain

7 months ago

Flow and diffusion-based video models are typically trained to denoise pixels. With a similar FLOP budget, they can simultaneously denoise additional video derivatives, such as optical flow, which captures motion more explicitly. This can significantly enhance motion and physics

thumb_up_off_alt25

chat_bubble_outline2

repeat0

shareShare

Jonathan Whitaker

@johnowhitaker

7 months ago

Yuval Kirstain Wow what a great idea! Obvious in hindsight, as the best often are.

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

AK

@_akhaliq

7 months ago

Meta just dropped VideoJAM Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models comparison with openai sora and kling

thumb_up_off_alt691

chat_bubble_outline18

repeat128

shareShare

Lucas Beyer (bl16)

@giffmana

7 months ago

This is extremely cool! They find diffusion loss is not very sensitive to motion. Thus they fine-tune videogen models with additional explicit motion prediction, making the model generate much more coherent videos. Also, Hila has been doing consistently good work, follow her!

thumb_up_off_alt276

chat_bubble_outline6

repeat23

shareShare

Yuhui Yuan

@rainbowyuhui

6 months ago

Thrilled to share our latest research on fundamental variable multi-layer transparent image generation, inspired by Schema Theory! ✨ ART enables precise control and scalable layer generation—pioneering a new paradigm for interactive content creation. 🚀 art-msra.github.io

thumb_up_off_alt49

chat_bubble_outline9

repeat13

shareShare

Aviv Bick

@avivbick

6 months ago

🔥 Llama-level performance with <0.1% of the training data 🔥 Together with Cartesia, we introduce Llamba—a family of recurrent language models distilled from Llama-3 into Mamba. ⚡ Sizes: 1B, 3B, 8B 🚀 Optimized for speed & on-device efficiency Details here 🧵👇

🔥 Llama-level performance with <0.1% of the training data 🔥

Together with <a href="/cartesia_ai/">Cartesia</a>, we introduce Llamba—a family of recurrent language models distilled from Llama-3 into Mamba.
⚡ Sizes: 1B, 3B, 8B
🚀 Optimized for speed & on-device efficiency

Details here 🧵👇

thumb_up_off_alt115

chat_bubble_outline5

repeat22

shareShare

Haibin

@eric_haibin_lin

6 months ago

Qiying Yu and team just dropped the DAPO algorithm (decoupled clip and dynamic sampling policy optimization)! DAPO-Zero-32B, a fully open-source RL reasoning model, surpasses DeepSeek-R1-Zero-Qwen-32B, and scores 50 on AIME 2024 with 50% fewer steps. It is trained with

<a href="/qiying_yu/">Qiying Yu</a> and team just dropped the DAPO algorithm (decoupled clip and dynamic sampling policy optimization)! DAPO-Zero-32B, a fully open-source RL reasoning model, surpasses DeepSeek-R1-Zero-Qwen-32B, and scores 50 on AIME 2024 with 50% fewer steps. It is trained with

thumb_up_off_alt462

chat_bubble_outline8

repeat114

shareShare

Xiaolong Wang

@xiaolonw

5 months ago

Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated

thumb_up_off_alt1,1K

chat_bubble_outline31

repeat181

shareShare

Aviv Bick

@avivbick

4 months ago

The Transformer–SSM retrieval gap is driven by just a few heads! SSMs lag on tasks like MMLU (multiple-choice) and GSM8K (math) due to in-context retrieval challenges. But here’s the twist: just a handful of heads handle retrieval in both architectures. What we found 👇 1/

thumb_up_off_alt192

chat_bubble_outline5

repeat28

shareShare

Ricky T. Q. Chen

@rickytqchen

3 months ago

Padding in our non-AR sequence models? Yuck. 🙅 👉 Instead of unmasking, our new work *Edit Flows* perform iterative refinements via position-relative inserts and deletes, operations naturally suited for variable-length sequence generation. Easily better than using mask tokens.

thumb_up_off_alt482

chat_bubble_outline8

repeat71

shareShare

Hila Chefer

@hila_chefer

3 months ago

Exciting news from #ICML2025 & #ICCV2025 🥳 - 🥇 VideoJAM accepted as *oral* at #ICML2025 (top 1%) - Two talks at #ICCV2025 ☝️interpretability in the generative era ✌️video customization - Organizing two #ICCV2025 workshops ☝️structural priors for vision ✌️long video gen 🧵👇

thumb_up_off_alt173

chat_bubble_outline15

repeat17

shareShare

Neta Shaul

@shaulneta

2 months ago

[1/n] New paper alert! 🚀 Excited to introduce 𝐓𝐫𝐚𝐧𝐬𝐢𝐭𝐢𝐨𝐧 𝐌𝐚𝐭𝐜𝐡𝐢𝐧𝐠 (𝐓𝐌)! We're replacing short-timestep kernels from Flow Matching/Diffusion with... a generative model🤯, achieving SOTA text-2-image generation! Uriel Singer Itai Gat Yaron Lipman

thumb_up_off_alt254

chat_bubble_outline4

repeat42

shareShare