Yuval Kirstain (@ykirstain) 's Twitter Profile
Yuval Kirstain

@ykirstain

Research Scientist @Meta | Building GenAI capabilities

ID: 1205923801709588481

calendar_today14-12-2019 18:53:43

401 Tweet

618 Followers

633 Following

Omri Avrahami (@omriavr) 's Twitter Profile Photo

[1/10] 🚨 We present our recent Snap Inc. project: Stable Flow --- A training-free method that performs various types of image editing operations (e.g., non-rigid editing, object addition and replacement) using flow models. Project page: omriavrahami.com/stable-flow

Ziqi Huang (@ziqi_huang_) 's Twitter Profile Photo

🎥 𝗩𝗕𝗲𝗻𝗰𝗵 𝗔𝗿𝗲𝗻𝗮: 𝗪𝗮𝘁𝗰𝗵 𝗔𝗜-𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲𝗱 𝗩𝗶𝗱𝗲𝗼𝘀 𝗜𝗻𝘀𝘁𝗮𝗻𝘁𝗹𝘆 🎥 ✅ 180,000+ AI-generated videos, 40+ models (and growing) ✅ You can optionally vote for your preferred outputs Try it here: huggingface.co/spaces/Vchitec…

Rohit Girdhar (@_rohitgirdhar_) 's Twitter Profile Photo

Super excited to share some recent work that shows that pure, text-only LLMs, can see and hear without any training! Our approach, called "MILS", uses LLMs with off-the-shelf multimodal models, to caption images/videos/audio, improve image generation, style transfer, and more!

Super excited to share some recent work that shows that pure, text-only LLMs, can see and hear without any training! Our approach, called "MILS", uses LLMs with off-the-shelf multimodal models, to caption images/videos/audio, improve image generation, style transfer, and more!
Yinbo Chen (@yinbochen) 's Twitter Profile Photo

Introducing “Diffusion Autoencoders are Scalable Image Tokenizers” (DiTo). We show that with proper designs and scaling up, diffusion autoencoders (a single L2 loss) can outperform the GAN-LPIPS tokenizers (hybrid losses) used in current SOTA generative models. (1/4)

Introducing “Diffusion Autoencoders are Scalable Image Tokenizers” (DiTo).

We show that with proper designs and scaling up, diffusion autoencoders (a single L2 loss) can outperform the GAN-LPIPS tokenizers (hybrid losses) used in current SOTA generative models. (1/4)
Ishan Misra (@imisra_) 's Twitter Profile Photo

Tokenizers in image/video generation are way understudied! "Standard" recipe: combinatorial search over different losses using a plethora of models DiTo is our attempt to break-away from this: simpler, scalable, and theoretically sound! Idea: Use diffusion to learn the tokens

Yuval Kirstain (@ykirstain) 's Twitter Profile Photo

Flow and diffusion-based video models are typically trained to denoise pixels. With a similar FLOP budget, they can simultaneously denoise additional video derivatives, such as optical flow, which captures motion more explicitly. This can significantly enhance motion and physics

AK (@_akhaliq) 's Twitter Profile Photo

Meta just dropped VideoJAM Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models comparison with openai sora and kling

Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

This is extremely cool! They find diffusion loss is not very sensitive to motion. Thus they fine-tune videogen models with additional explicit motion prediction, making the model generate much more coherent videos. Also, Hila has been doing consistently good work, follow her!

Yuhui Yuan (@rainbowyuhui) 's Twitter Profile Photo

Thrilled to share our latest research on fundamental variable multi-layer transparent image generation, inspired by Schema Theory! ✨ ART enables precise control and scalable layer generation—pioneering a new paradigm for interactive content creation. 🚀 art-msra.github.io

Thrilled to share our latest research on fundamental variable multi-layer transparent image generation, inspired by Schema Theory! ✨ ART enables precise control and scalable layer generation—pioneering a new paradigm for interactive content creation. 🚀

art-msra.github.io
Aviv Bick (@avivbick) 's Twitter Profile Photo

🔥 Llama-level performance with <0.1% of the training data 🔥 Together with Cartesia, we introduce Llamba—a family of recurrent language models distilled from Llama-3 into Mamba. ⚡ Sizes: 1B, 3B, 8B 🚀 Optimized for speed & on-device efficiency Details here 🧵👇

🔥 Llama-level performance with &lt;0.1% of the training data 🔥

Together with <a href="/cartesia_ai/">Cartesia</a>, we introduce Llamba—a family of recurrent language models distilled from Llama-3 into Mamba.
⚡ Sizes: 1B, 3B, 8B
🚀 Optimized for speed &amp; on-device efficiency

Details here 🧵👇
Haibin (@eric_haibin_lin) 's Twitter Profile Photo

Qiying Yu and team just dropped the DAPO algorithm (decoupled clip and dynamic sampling policy optimization)! DAPO-Zero-32B, a fully open-source RL reasoning model, surpasses DeepSeek-R1-Zero-Qwen-32B, and scores 50 on AIME 2024 with 50% fewer steps. It is trained with

<a href="/qiying_yu/">Qiying Yu</a> and team just dropped the DAPO algorithm (decoupled clip and dynamic sampling policy optimization)! DAPO-Zero-32B, a fully open-source RL reasoning model, surpasses DeepSeek-R1-Zero-Qwen-32B, and scores 50 on AIME 2024 with 50% fewer steps. It is trained with
Xiaolong Wang (@xiaolonw) 's Twitter Profile Photo

Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated

Aviv Bick (@avivbick) 's Twitter Profile Photo

The Transformer–SSM retrieval gap is driven by just a few heads! SSMs lag on tasks like MMLU (multiple-choice) and GSM8K (math) due to in-context retrieval challenges. But here’s the twist: just a handful of heads handle retrieval in both architectures. What we found 👇 1/

The Transformer–SSM retrieval gap is driven by just a few heads!

SSMs lag on tasks like MMLU (multiple-choice) and GSM8K (math) due to in-context retrieval challenges.
But here’s the twist: just a handful of heads handle retrieval in both architectures.
What we found 👇 1/
Ricky T. Q. Chen (@rickytqchen) 's Twitter Profile Photo

Padding in our non-AR sequence models? Yuck. 🙅 👉 Instead of unmasking, our new work *Edit Flows* perform iterative refinements via position-relative inserts and deletes, operations naturally suited for variable-length sequence generation. Easily better than using mask tokens.

Hila Chefer (@hila_chefer) 's Twitter Profile Photo

Exciting news from #ICML2025 & #ICCV2025 🥳 - 🥇 VideoJAM accepted as *oral* at #ICML2025 (top 1%) - Two talks at #ICCV2025 ☝️interpretability in the generative era ✌️video customization - Organizing two #ICCV2025 workshops ☝️structural priors for vision ✌️long video gen 🧵👇

Neta Shaul (@shaulneta) 's Twitter Profile Photo

[1/n] New paper alert! 🚀 Excited to introduce 𝐓𝐫𝐚𝐧𝐬𝐢𝐭𝐢𝐨𝐧 𝐌𝐚𝐭𝐜𝐡𝐢𝐧𝐠 (𝐓𝐌)! We're replacing short-timestep kernels from Flow Matching/Diffusion with... a generative model🤯, achieving SOTA text-2-image generation! Uriel Singer Itai Gat Yaron Lipman