Michael Goin (@mgoin_) 's Twitter Profile
Michael Goin

@mgoin_

Engineering Lead @neuralmagic @redhat | Committer @vllm_project | Compressing LLMs and making fast software

ID: 1516059175910092803

linkhttps://github.com/mgoin calendar_today18-04-2022 14:21:01

576 Tweet

733 Followers

278 Following

Charles 🎉 Frye (@charles_irl) 's Twitter Profile Photo

Dozens of teams have asked my advice on running LLMs. How fast is DeepSeek V3 with vLLM on 8 GPUs? What's the max throughput of Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor

Michael Goin (@mgoin_) 's Twitter Profile Photo

A treasure trove of LLM serving benchmarks! This almanac is full of insightful analysis around the results, which I think will help us all benchmark better in the future. I recommend reading the executive summary for some performance prose :)

Red Hat AI (@redhat_ai) 's Twitter Profile Photo

How does llm-d schedule LLM requests across nodes, given your workload mix and vLLM’s architecture? This video covers: ✅ Single-workload scheduling ✅ Disaggregated prefill/decode ✅ High-load strategies ✅ Session-aware routing youtube.com/watch?v=L6vxtg…

vLLM (@vllm_project) 's Twitter Profile Photo

⬆️ uv pip install -U vllm --extra-index-url wheels.vllm.ai/0.9.1rc1 --torch-backend=auto Try out Magistral on with vLLM 0.9.1rc1 today! 🔮

Michael Goin (@mgoin_) 's Twitter Profile Photo

Exciting first day talking about vLLM in Singapore! I had an great time discussing in depth with EmbeddedLLM on how we will make AMD better across the diverse features and workloads in vLLM. So thankful for our vibrant OSS community across the world 🫶

Exciting first day talking about <a href="/vllm_project/">vLLM</a> in Singapore! I had an great time discussing in depth with <a href="/EmbeddedLLM/">EmbeddedLLM</a> on how we will make <a href="/AMD/">AMD</a> better across the diverse features and workloads in vLLM. So thankful for our vibrant OSS community across the world 🫶
Eldar Kurtic (@_eldarkurtic) 's Twitter Profile Photo

The recording of Erwan Gallen's and my PyTorch Day France 2025 and GOSIM Foundation talk, "Scaling LLM Inference with vLLM," is now available on PyTorch’s YouTube channel. youtube.com/watch?v=XYh6Xf…

PyTorch (@pytorch) 's Twitter Profile Photo

PyTorch and vLLM are both critical to the AI ecosystem and are increasingly being used together for cutting edge generative AI applications, including inference, post-training, and agentic systems at scale. 🔗 Learn more about PyTorch → vLLM integrations and what’s to come:

PyTorch and vLLM are both critical to the AI ecosystem and are increasingly being used together for cutting edge generative AI applications, including inference, post-training, and agentic systems at scale.

🔗 Learn more about PyTorch → vLLM integrations and what’s to come:
Charles 🎉 Frye (@charles_irl) 's Twitter Profile Photo

someone asked me about how to prepare to work for a systems engineering company this is the advice i gave honestly, i think it's useful for almost all software engineers -- effort spent learning things that change slowly is better amortized. and important things change slowly!

someone asked me about how to prepare to work for a systems engineering company

this is the advice i gave

honestly, i think it's useful for almost all software engineers -- effort spent learning things that change slowly is better amortized. and important things change slowly!
SemiAnalysis (@semianalysis_) 's Twitter Profile Photo

Happy 4th July! Speed is the Moat & Anush Elangovan & his team Keeps Running Faster & Faster. Still lots of areas where ROCm has gaps but many are already closing

vLLM (@vllm_project) 's Twitter Profile Photo

We genuinely want to solve this problem. As many (Tan TJian samsja Daniel Han Eldar Kurtić and more!) chimed in, the reason includes attention kernels, matmul reduction order, precisions in various operators, and more!

EmbeddedLLM (@embeddedllm) 's Twitter Profile Photo

Pro-tip for vLLM power-users: free ≈ 90 % of your GPU VRAM in seconds—no restarts required🚀 🚩 Why you’ll want this • Hot-swap new checkpoints on the same card • Rotate multiple LLMs on one GPU (batch jobs, micro-services, A/B tests) • Stage-based pipelines that call

Pro-tip for vLLM power-users: free ≈ 90 % of your GPU VRAM in seconds—no restarts required🚀

🚩 Why you’ll want this
• Hot-swap new checkpoints on the same card
• Rotate multiple LLMs on one GPU (batch jobs, micro-services, A/B tests)
• Stage-based pipelines that call
vLLM (@vllm_project) 's Twitter Profile Photo

Kimi.ai just released a trillion-parameter model with great agentic capability, and it is already supported in vLLM! Have a try with a simple command, and check the doc for more advanced deployment🚀

<a href="/Kimi_Moonshot/">Kimi.ai</a> just released a trillion-parameter model with great agentic capability, and it is already supported in vLLM! Have a try with a simple command, and check the doc for more advanced deployment🚀
Dan Alistarh (@dalistarh) 's Twitter Profile Photo

Announcing our early work on FP4 inference for LLMs! - QuTLASS: low-precision kernel support for Blackwell GPUs - FP-Quant: a flexible quantization harness for Llama/Qwen We reach 4x speedup vs BF16, with good accuracy through MXFP4 microscaling + fused Hadamard rotations.

Announcing our early work on FP4 inference for LLMs! 
- QuTLASS: low-precision kernel support for Blackwell GPUs
- FP-Quant: a flexible quantization harness for Llama/Qwen 
We reach 4x speedup vs BF16, with good accuracy through MXFP4 microscaling + fused Hadamard rotations.
Cohere Labs (@cohere_labs) 's Twitter Profile Photo

Tune in tomorrow, July 16th for a session with Kimbo on "Dragon in the CUDA Moat: NVIDIA Tensor Core Evolution" Learn more: cohere.com/events/Cohere-…