Michael Goin (@mgoin_) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Dozens of teams have asked my advice on running LLMs. How fast is DeepSeek V3 with vLLM on 8 GPUs? What's the max throughput of Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor

thumb_up_off_alt594

chat_bubble_outline26

repeat79

shareShare

Michael Goin

@mgoin_

2 months ago

A treasure trove of LLM serving benchmarks! This almanac is full of insightful analysis around the results, which I think will help us all benchmark better in the future. I recommend reading the executive summary for some performance prose :)

thumb_up_off_alt11

chat_bubble_outline0

repeat2

shareShare

Red Hat AI

@redhat_ai

2 months ago

How does llm-d schedule LLM requests across nodes, given your workload mix and vLLM’s architecture? This video covers: ✅ Single-workload scheduling ✅ Disaggregated prefill/decode ✅ High-load strategies ✅ Session-aware routing youtube.com/watch?v=L6vxtg…

thumb_up_off_alt52

chat_bubble_outline1

repeat17

shareShare

Michael Goin

@mgoin_

2 months ago

Hope you are ready for 10 minutes on what is new in vllm==0.9.0! 🧑‍🏫

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

vLLM

@vllm_project

2 months ago

⬆️ uv pip install -U vllm --extra-index-url wheels.vllm.ai/0.9.1rc1 --torch-backend=auto Try out Magistral on with vLLM 0.9.1rc1 today! 🔮

thumb_up_off_alt151

chat_bubble_outline1

repeat22

shareShare

Red Hat AI

@redhat_ai

a month ago

vLLM Office Hours #27 – Distributed Inference with llm-d x.com/i/broadcasts/1…

thumb_up_off_alt22

chat_bubble_outline0

repeat6

shareShare

vLLM

@vllm_project

a month ago

Thank you AMD Lisa Su Anush Elangovan for Advancing AI together with vLLM! We look forward to the continued partnership and pushing the boundary of inference.

Thank you <a href="/AMD/">AMD</a> <a href="/LisaSu/">Lisa Su</a> <a href="/AnushElangovan/">Anush Elangovan</a> for Advancing AI together with <a href="/vllm_project/">vLLM</a>! We look forward to the continued partnership and pushing the boundary of inference.

thumb_up_off_alt217

chat_bubble_outline4

repeat20

shareShare

Eldar Kurtic

@_eldarkurtic

a month ago

LLM-Compressor now integrated with Axolotl!

thumb_up_off_alt11

chat_bubble_outline0

repeat3

shareShare

Michael Goin

@mgoin_

a month ago

Exciting first day talking about vLLM in Singapore! I had an great time discussing in depth with EmbeddedLLM on how we will make AMD better across the diverse features and workloads in vLLM. So thankful for our vibrant OSS community across the world 🫶

Exciting first day talking about <a href="/vllm_project/">vLLM</a> in Singapore! I had an great time discussing in depth with <a href="/EmbeddedLLM/">EmbeddedLLM</a> on how we will make <a href="/AMD/">AMD</a> better across the diverse features and workloads in vLLM. So thankful for our vibrant OSS community across the world 🫶

thumb_up_off_alt48

chat_bubble_outline4

repeat11

shareShare

Eldar Kurtic

@_eldarkurtic

a month ago

The recording of Erwan Gallen's and my PyTorch Day France 2025 and GOSIM Foundation talk, "Scaling LLM Inference with vLLM," is now available on PyTorch’s YouTube channel. youtube.com/watch?v=XYh6Xf…

thumb_up_off_alt23

chat_bubble_outline1

repeat3

shareShare

Michael Goin

@mgoin_

a month ago

Almost an hour on what is new in LLM Compressor from the team! Lots of new features for open source compression

thumb_up_off_alt10

chat_bubble_outline0

repeat2

shareShare

PyTorch

@pytorch

a month ago

PyTorch and vLLM are both critical to the AI ecosystem and are increasingly being used together for cutting edge generative AI applications, including inference, post-training, and agentic systems at scale. 🔗 Learn more about PyTorch → vLLM integrations and what’s to come:

thumb_up_off_alt307

chat_bubble_outline3

repeat63

shareShare

Charles 🎉 Frye

@charles_irl

a month ago

someone asked me about how to prepare to work for a systems engineering company this is the advice i gave honestly, i think it's useful for almost all software engineers -- effort spent learning things that change slowly is better amortized. and important things change slowly!

thumb_up_off_alt145

chat_bubble_outline6

repeat7

shareShare

SemiAnalysis

@semianalysis_

21 days ago

Happy 4th July! Speed is the Moat & Anush Elangovan & his team Keeps Running Faster & Faster. Still lots of areas where ROCm has gaps but many are already closing

thumb_up_off_alt403

chat_bubble_outline15

repeat31

shareShare

vLLM

@vllm_project

19 days ago

We genuinely want to solve this problem. As many (Tan TJian samsja Daniel Han Eldar Kurtić and more!) chimed in, the reason includes attention kernels, matmul reduction order, precisions in various operators, and more!

thumb_up_off_alt274

chat_bubble_outline4

repeat18

shareShare

EmbeddedLLM

@embeddedllm

18 days ago

Pro-tip for vLLM power-users: free ≈ 90 % of your GPU VRAM in seconds—no restarts required🚀 🚩 Why you’ll want this • Hot-swap new checkpoints on the same card • Rotate multiple LLMs on one GPU (batch jobs, micro-services, A/B tests) • Stage-based pipelines that call

thumb_up_off_alt83

chat_bubble_outline3

repeat10

shareShare

vLLM

@vllm_project

15 days ago

Kimi.ai just released a trillion-parameter model with great agentic capability, and it is already supported in vLLM! Have a try with a simple command, and check the doc for more advanced deployment🚀

<a href="/Kimi_Moonshot/">Kimi.ai</a> just released a trillion-parameter model with great agentic capability, and it is already supported in vLLM! Have a try with a simple command, and check the doc for more advanced deployment🚀

thumb_up_off_alt90

chat_bubble_outline2

repeat14

shareShare

Dan Alistarh

@dalistarh

12 days ago

Announcing our early work on FP4 inference for LLMs! - QuTLASS: low-precision kernel support for Blackwell GPUs - FP-Quant: a flexible quantization harness for Llama/Qwen We reach 4x speedup vs BF16, with good accuracy through MXFP4 microscaling + fused Hadamard rotations.

thumb_up_off_alt191

chat_bubble_outline4

repeat37

shareShare

main

@main_horse

11 days ago

H-Nets are the future.

thumb_up_off_alt689

chat_bubble_outline10

repeat72

shareShare

Cohere Labs

@cohere_labs

11 days ago

Tune in tomorrow, July 16th for a session with Kimbo on "Dragon in the CUDA Moat: NVIDIA Tensor Core Evolution" Learn more: cohere.com/events/Cohere-…

thumb_up_off_alt19

chat_bubble_outline1

repeat3

shareShare