Zihao Ye (@ye_combinator) 's Twitter Profile
Zihao Ye

@ye_combinator

Building flashinfer (github.com/flashinfer-ai/…)

ID: 916605919210827777

linkhttps://homes.cs.washington.edu/~zhye/ calendar_today07-10-2017 10:07:35

118 Tweet

1,1K Followers

511 Following

Sainbayar Sukhbaatar (@tesatory) 's Twitter Profile Photo

Ten years ago in 2015 we published a paper called End-to-End Memory Networks (arxiv.org/abs/1503.08895). Looking back, this paper had many of the ingredients of current LLMs. Our model was the first language model that completely replaced RNN with attention. It had dot-product

Ten years ago in 2015 we published a paper called End-to-End Memory Networks (arxiv.org/abs/1503.08895). Looking back, this paper had many of the ingredients of current LLMs. Our model was the first language model that completely replaced RNN with attention. It had dot-product
Tim Dettmers (@tim_dettmers) 's Twitter Profile Photo

Happy to announce that I joined the CMU Catalyst with three of my incoming students. Our research will bring the best models to consumer GPUs with a focus on agent systems and MoEs. It is amazing to see so many talented people at Catalyst -- a very exciting ecosystem!

Joy Dong (@joychew_d) 's Twitter Profile Photo

Super excited to release FlexAttention for Inference with a decoding backend, GQA, PagedAttention, trainable bias and more! Meet us at the MLSys '25 conference in Santa Clara -- We will present FlexAttention on Wed May 14. #MLsys

Si-ze Zheng (@deeplyignorant) 's Twitter Profile Photo

🚀 We released Triton-distributed! 🌟 Build compute-comm. overlapping kernels for GPUs—performance rivals optimized libraries 🔗 github.com/ByteDance-Seed… 👏 Shoutout to AMD for testing our work! Check their blog: 🔗 …rocm-blogs--981.com.readthedocs.build/projects/inter…

NovaSky (@novaskyai) 's Twitter Profile Photo

1/N Introducing SkyRL-v0, our RL training pipeline enabling efficient RL training for long-horizon, real-environment tasks like SWE-Bench. We also open-source a series of our early trained models to showcase the potential of end-to-end online RL training on long-horizon (20-50

1/N Introducing SkyRL-v0, our RL training pipeline enabling efficient RL training for long-horizon, real-environment tasks like SWE-Bench. We also open-source a series of our early trained models to showcase the potential of end-to-end online RL training on long-horizon (20-50
Yixin Dong (@yi_xin_dong) 's Twitter Profile Photo

We are hosting a happy hour with LMSYS Org at #mlsys2025! Join us for engaging talks on SGLang, the structured generation library XGrammar, and the high-performance kernel library FlashInfer. Enjoy great food, lively discussions, and connect with the community! Click to join 👉

We are hosting a happy hour with <a href="/lmsysorg/">LMSYS Org</a> at #mlsys2025! Join us for engaging talks on SGLang, the structured generation library XGrammar, and the high-performance kernel library FlashInfer. Enjoy great food, lively discussions, and connect with the community! Click to join 👉
Tianqi Chen (@tqchenml) 's Twitter Profile Photo

If you are around in the Bay Area, make sure to attend the #MLSys2025 keynote tomorrow by Soumith Chintala at the Santa Clara Convention Center. Checkout the full program at mlsys.org

Vijay (@__tensorcore__) 's Twitter Profile Photo

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

🚨🔥 CUTLASS 4.0 is released 🔥🚨

pip install nvidia-cutlass-dsl

4.0 marks a major shift for CUTLASS: towards native GPU programming in Python

slidehelloworld.png

docs.nvidia.com/cutlass/media/…
Tri Dao (@tri_dao) 's Twitter Profile Photo

I love Cutlass, and this new Python DSL looks very well-designed. Will for sure accelerate kernel dev + exploring new ideas in ML + GPU. I'm already playing with it and having fun

Zihao Ye (@ye_combinator) 's Twitter Profile Photo

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to LMSYS Org’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

NVIDIA HPC Developer (@nvidiahpcdev) 's Twitter Profile Photo

🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and

🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and
Enze Xie (@xieenze_jr) 's Twitter Profile Photo

🚀 Fast-dLLM: 27.6× Faster Diffusion LLMs with KV Cache & Parallel Decoding 💥 Key Features🌟 - Block-Wise KV Cache Reuses 90%+ attention activations via bidirectional caching (prefix/suffix), enabling 8.1×–27.6× throughput gains with <2% accuracy loss 🔄 -

🚀 Fast-dLLM: 27.6× Faster Diffusion LLMs with KV Cache &amp; Parallel Decoding 💥  

Key Features🌟  
- Block-Wise KV Cache  
  Reuses 90%+ attention activations via bidirectional caching (prefix/suffix), enabling 8.1×–27.6× throughput gains with &lt;2% accuracy loss 🔄  
-
Han Guo (@hanguo97) 's Twitter Profile Photo

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?

Introducing Log-Linear Attention with:

- Log-linear time training
- Log-time inference (in both time and memory)
- Hardware-efficient Triton kernels
Vijay (@__tensorcore__) 's Twitter Profile Photo

Another 🔥 blog about CUTLASS from Colfax International, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them. research.colfax-intl.com/cutlass-tutori…

NVIDIA AI Developer (@nvidiaaidev) 's Twitter Profile Photo

🔍 Our Deep Dive Blog Covering our Winning MLSys Paper on FlashInfer Is now live ➡️ nvda.ws/3ZA1Hca Accelerate LLM inference with FlashInfer—NVIDIA’s high-performance, JIT-compiled library built for ultra-efficient transformer inference on GPUs. Go under the hood with

🔍 Our Deep Dive Blog Covering our Winning MLSys Paper on FlashInfer Is now live ➡️ nvda.ws/3ZA1Hca

Accelerate LLM inference with FlashInfer—NVIDIA’s high-performance, JIT-compiled library built for ultra-efficient transformer inference on GPUs.

Go under the hood with
zhyncs (@zhyncs42) 's Twitter Profile Photo

SGLang is an early user of FlashInfer and witnessed its rise as the de facto LLM inference kernel library. It won best paper at MLSys 2025, and Zihao now leads its development NVIDIA AI Developer. SGLang’s GB200 NVL72 optimizations were made possible with strong support from the

Infini-AI-Lab (@infiniailab) 's Twitter Profile Photo

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n