Dan Fu (@realdanfu) 's Twitter Profile
Dan Fu

@realdanfu

Incoming assistant professor at UCSD CSE in MLSys. Currently recruiting students! Also running the kernels team @togethercompute.

ID: 1173687463790829568

linkhttp://danfu.org calendar_today16-09-2019 19:58:03

710 Tweet

5,5K Followers

205 Following

Austin Silveria (@austinsilveria) 's Twitter Profile Photo

chipmunk is up on arxiv! across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps caching + sparsity speeds up generation by only recomputing fast changing activations

chipmunk is up on arxiv!

across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention and MLPs account for 70-90% of the change in activations across steps

caching + sparsity speeds up generation by only recomputing fast changing activations
Infini-AI-Lab (@infiniailab) 's Twitter Profile Photo

🥳 Happy to share our new work –  Kinetics: Rethinking Test-Time Scaling Laws 🤔How to effectively build a powerful reasoning agent? Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model. But, It only shows half of the picture! 🚨 The O(N²)

🥳 Happy to share our new work –  Kinetics: Rethinking Test-Time Scaling Laws

🤔How to effectively build a powerful reasoning agent?

Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model.
But, It only shows half of the picture!

🚨 The O(N²)
Sabri Eyuboglu (@eyuboglusabri) 's Twitter Profile Photo

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size.

What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x
Hermann (@kumbonghermann) 's Twitter Profile Photo

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week.

VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from
Dan Fu (@realdanfu) 's Twitter Profile Photo

Announcing HMAR - Efficient Hierarchical Masked Auto-Regressive Image Generation, led by Hermann! HMAR is hardware-efficient, reformulates autoregressive image generation in a way that can take advantage of tensor cores. Hermann is presenting it at CVPR this week!

Keshigeyan Chandrasegaran (@keshigeyan) 's Twitter Profile Photo

1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 grafting.stanford.edu Co-led with Michael Poli

Dan Fu (@realdanfu) 's Twitter Profile Photo

And to close out a trio of diffusion papers… Super excited to announce Grafting - a method for distilling pretrained diffusion transformers into *new architectures*, led by Keshigeyan Chandrasegaran! Swap attention for new primitives for 2% pretraining cost, exciting for modeling research!

Alex Ratner (@ajratner) 's Twitter Profile Photo

Scale alone is not enough for AI data. Quality and complexity are equally critical. Excited to support all of these for LLM developers with Snorkel AI Data-as-a-Service, and to share our new leaderboard! — Our decade-plus of research and work in AI data has a simple point:

soham (@sohamgovande) 's Twitter Profile Photo

Chipmunks can now hop across multiple GPU architectures (sm_80, sm_89, sm_90). You can get a 1.4-3x lossless speedup when generating videos on A100s, 4090s, and H100s! Chipmunks also play with more open-source models: Mochi, Wan, & others (w/ tutorials for integration) 🐿️

Chipmunks can now hop across multiple GPU architectures (sm_80, sm_89, sm_90). You can get a 1.4-3x lossless speedup when generating videos on A100s, 4090s, and H100s!

Chipmunks also play with more open-source models: Mochi, Wan, & others (w/ tutorials for integration) 🐿️
Hermann (@kumbonghermann) 's Twitter Profile Photo

Happy to share that our HMAR code and pre-trained models are now publicly available. Please try them out here: code: github.com/NVlabs/HMAR checkpoints: huggingface.co/nvidia/HMAR