Zihao Ye (@ye_combinator) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Ten years ago in 2015 we published a paper called End-to-End Memory Networks (arxiv.org/abs/1503.08895). Looking back, this paper had many of the ingredients of current LLMs. Our model was the first language model that completely replaced RNN with attention. It had dot-product

thumb_up_off_alt595

chat_bubble_outline15

repeat116

shareShare

Tim Dettmers

@tim_dettmers

3 months ago

Happy to announce that I joined the CMU Catalyst with three of my incoming students. Our research will bring the best models to consumer GPUs with a focus on agent systems and MoEs. It is amazing to see so many talented people at Catalyst -- a very exciting ecosystem!

thumb_up_off_alt351

chat_bubble_outline13

repeat59

shareShare

Lequn Chen

@abcdabcd987

3 months ago

Lower latency and Higher throughput -- Get both with multi-node deployment for MoE models like DeepSeek-V3/R1.

thumb_up_off_alt31

chat_bubble_outline0

repeat8

shareShare

Hieu Pham

@hyhieu226

3 months ago

research.colfax-intl.com/cutlass-tutori… Their content always comes out in great quantity and quality ❤️

thumb_up_off_alt143

chat_bubble_outline0

repeat19

shareShare

Joy Dong

@joychew_d

3 months ago

Super excited to release FlexAttention for Inference with a decoding backend, GQA, PagedAttention, trainable bias and more! Meet us at the MLSys '25 conference in Santa Clara -- We will present FlexAttention on Wed May 14. #MLsys

thumb_up_off_alt36

chat_bubble_outline0

repeat7

shareShare

Si-ze Zheng

@deeplyignorant

3 months ago

🚀 We released Triton-distributed! 🌟 Build compute-comm. overlapping kernels for GPUs—performance rivals optimized libraries 🔗 github.com/ByteDance-Seed… 👏 Shoutout to AMD for testing our work! Check their blog: 🔗 …rocm-blogs--981.com.readthedocs.build/projects/inter…

thumb_up_off_alt53

chat_bubble_outline2

repeat10

shareShare

NovaSky

@novaskyai

3 months ago

1/N Introducing SkyRL-v0, our RL training pipeline enabling efficient RL training for long-horizon, real-environment tasks like SWE-Bench. We also open-source a series of our early trained models to showcase the potential of end-to-end online RL training on long-horizon (20-50

thumb_up_off_alt266

chat_bubble_outline2

repeat68

shareShare

Yixin Dong

@yi_xin_dong

3 months ago

We are hosting a happy hour with LMSYS Org at #mlsys2025! Join us for engaging talks on SGLang, the structured generation library XGrammar, and the high-performance kernel library FlashInfer. Enjoy great food, lively discussions, and connect with the community! Click to join 👉

We are hosting a happy hour with <a href="/lmsysorg/">LMSYS Org</a> at #mlsys2025! Join us for engaging talks on SGLang, the structured generation library XGrammar, and the high-performance kernel library FlashInfer. Enjoy great food, lively discussions, and connect with the community! Click to join 👉

thumb_up_off_alt80

chat_bubble_outline1

repeat12

shareShare

Tianqi Chen

@tqchenml

3 months ago

If you are around in the Bay Area, make sure to attend the #MLSys2025 keynote tomorrow by Soumith Chintala at the Santa Clara Convention Center. Checkout the full program at mlsys.org

thumb_up_off_alt29

chat_bubble_outline0

repeat8

shareShare

Vijay

@__tensorcore__

2 months ago

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

thumb_up_off_alt407

chat_bubble_outline15

repeat81

shareShare

Tri Dao

@tri_dao

2 months ago

I love Cutlass, and this new Python DSL looks very well-designed. Will for sure accelerate kernel dev + exploring new ideas in ML + GPU. I'm already playing with it and having fun

thumb_up_off_alt225

chat_bubble_outline4

repeat25

shareShare

Zihao Ye

@ye_combinator

2 months ago

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to LMSYS Org’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

thumb_up_off_alt229

chat_bubble_outline16

repeat37

shareShare

NVIDIA HPC Developer

@nvidiahpcdev

2 months ago

🎉CUTLASS 4.0 is here-bringing native #Python support for device-side kernel design, for ops like GEMM, Flash Attention, and more, powered by the new CuTe DSL. For the first time, you can write high-performance GPU kernels in Python with the same abstractions, APIs, and

thumb_up_off_alt115

chat_bubble_outline1

repeat36

shareShare

Enze Xie

@xieenze_jr

2 months ago

🚀 Fast-dLLM: 27.6× Faster Diffusion LLMs with KV Cache & Parallel Decoding 💥 Key Features🌟 - Block-Wise KV Cache Reuses 90%+ attention activations via bidirectional caching (prefix/suffix), enabling 8.1×–27.6× throughput gains with <2% accuracy loss 🔄 -

thumb_up_off_alt174

chat_bubble_outline8

repeat34

shareShare

Han Guo

@hanguo97

2 months ago

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat185

shareShare

GPU MODE

@gpu_mode

2 months ago

Been excited about this talk for a while, Songlin Yang on efficient architecture! Just started! youtube.com/watch?v=j4zJbr…

thumb_up_off_alt188

chat_bubble_outline0

repeat27

shareShare

Vijay

@__tensorcore__

2 months ago

Another 🔥 blog about CUTLASS from Colfax International, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them. research.colfax-intl.com/cutlass-tutori…

thumb_up_off_alt156

chat_bubble_outline0

repeat34

shareShare

NVIDIA AI Developer

@nvidiaaidev

a month ago

🔍 Our Deep Dive Blog Covering our Winning MLSys Paper on FlashInfer Is now live ➡️ nvda.ws/3ZA1Hca Accelerate LLM inference with FlashInfer—NVIDIA’s high-performance, JIT-compiled library built for ultra-efficient transformer inference on GPUs. Go under the hood with

thumb_up_off_alt82

chat_bubble_outline8

repeat25

shareShare

zhyncs

@zhyncs42

a month ago

SGLang is an early user of FlashInfer and witnessed its rise as the de facto LLM inference kernel library. It won best paper at MLSys 2025, and Zihao now leads its development NVIDIA AI Developer. SGLang’s GB200 NVL72 optimizations were made possible with strong support from the

thumb_up_off_alt91

chat_bubble_outline2

repeat13

shareShare

Infini-AI-Lab

@infiniailab

a month ago

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

thumb_up_off_alt207

chat_bubble_outline2

repeat76

shareShare

Zihao Ye

Gate.io

Sainbayar Sukhbaatar

Tim Dettmers

Lequn Chen

Hieu Pham

Joy Dong

Si-ze Zheng

NovaSky

Yixin Dong

Tianqi Chen

Vijay

Tri Dao

Zihao Ye

NVIDIA HPC Developer

Enze Xie

Han Guo

GPU MODE

Vijay

NVIDIA AI Developer

zhyncs

Infini-AI-Lab