brian stevens (@addvin) Twitter Tweets • TwiCopy

vLLM

a year ago

⚡Llama 3.1 series are uniquely challenging due to long context and large size. We want to thank Red Hat AI (formerly Neural Magic) for their continual stewardship of the quantization code path in vLLM, Anyscale for their high quality implementation of chunked prefill and speculative decoding,

thumb_up_off_alt11

chat_bubble_outline1

repeat3

shareShare

Mark Kurtz

@markurtz_

a year ago

🧵1/4 Our Llama 3.1 compression project is underway, aiming for cost-effective and sustainable deployments without compromising accuracy. The FP8 quantized Llama 3.1 8B model has already achieved over 99% recovery! 🎯📉 #LLMs #vLLM #AI #MachineLearning #Quantization

thumb_up_off_alt9

chat_bubble_outline5

repeat2

shareShare

Roshan Sumbaly

@rsumbaly

a year ago

Great to see the community moving fast to adapt Llama 3.1 to their needs. This is the beauty of open-source and key part of why we're going to share more of our system-level thinking with Llama Stack. Great work vLLM and @neuralmagic folks - let's find more ways to work

thumb_up_off_alt17

chat_bubble_outline1

repeat5

shareShare

Red Hat AI

@redhat_ai

a year ago

Our team has been busy releasing quantized Llama 3.1 models, thoroughly evaluated to ensure optimal performance in #vLLM. Check them out and let us know your thoughts and feedback by commenting on this post. 🙏 AI at Meta vLLM huggingface.co/collections/ne…

thumb_up_off_alt106

chat_bubble_outline1

repeat33

shareShare

Red Hat AI

@redhat_ai

a year ago

vLLM is the leading open-source inference server with 24k GitHub stars. Join us for bi-weekly vLLM Office Hours to learn about the project, get involved, and provide feedback. Here's what to expect this week: 1. Get the latest updates from Neural Magic’s Engineering Lead and top

thumb_up_off_alt11

chat_bubble_outline0

repeat3

shareShare

brian stevens

@addvin

a year ago

Very cool!

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

Red Hat AI

@redhat_ai

a year ago

Sparse-Marlin is here and integrated into vLLM! This GPU-optimized kernel accelerates matrix multiplication with 4-bit quantized weights and 2:4 sparsity, achieving 5.3x speedups on NVIDIA GPUs (Ampere/Ada). Maintains efficiency with batch sizes up to 32. Links below.

Sparse-Marlin is here and integrated into <a href="/vllm_project/">vLLM</a>! This GPU-optimized kernel accelerates matrix multiplication with 4-bit quantized weights and 2:4 sparsity, achieving 5.3x speedups on NVIDIA GPUs (Ampere/Ada). Maintains efficiency with batch sizes up to 32. Links below.

thumb_up_off_alt96

chat_bubble_outline2

repeat20

shareShare

brian stevens

@addvin

a year ago

Quantized versions of Llama-3.2 now available ...

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

brian stevens

@addvin

a year ago

Quantization of LLM models is critical for efficient deployments. But how to avoid any negative impact of quantization on model capability? Our latest research across Llama variants will serve as a great guide.

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

brian stevens

@addvin

a year ago

I’m thrilled to announce that Neural Magic has signed a definitive agreement to join forces with Red Hat, Inc. At Neural Magic our vision is that the future of AI is open, and we have been on a mission to enable enterprises to capture the powerful innovation from AI, while at

thumb_up_off_alt128

chat_bubble_outline17

repeat35

shareShare

Scale ML

@scaleml

a year ago

For our last seminar of the year we will end with Lucas Wilkinson from @neuralmagic presenting! Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs Time: Dec 4, 3pm EST Sign up via scale-ml.org to join our mailing list for the zoom link

thumb_up_off_alt17

chat_bubble_outline0

repeat4

shareShare

Red Hat AI

@redhat_ai

a year ago

If you are at #NeurIPS2024 this week, stop by the Neural Magic booth #307 and talk to us about the vLLM! vLLM core committer Michael Goin will be there, ready to hear your ideas and share them with the team. The best feature requests always come from in-person chats!

thumb_up_off_alt7

chat_bubble_outline1

repeat1

shareShare

Matt Hicks

@matthicksj

10 months ago

At Red Hat, we believe the future of AI is open. That's why I'm incredibly excited about our acquisition of Red Hat AI (formerly Neural Magic). Together, we're furthering our commitment to our customers and the open source community to deliver on the future of AI—and that starts today.

thumb_up_off_alt83

chat_bubble_outline0

repeat28

shareShare

brian stevens

@addvin

10 months ago

Today it become official, Neural Magic now a part of Red Hat.

thumb_up_off_alt39

chat_bubble_outline2

repeat8

shareShare

Red Hat AI

@redhat_ai

8 months ago

DeepSeek’s Open Source Week drops A LOT of exciting goodies! We’re hosting vLLM Office Hours tomorrow—learn what they are, how they integrate with vLLM, & ask questions! Date: Thursday, Thu, Feb 27 Time: 2PM ET / 11AM PT Register: neuralmagic.com/community-offi… #DeepSeek #AI

thumb_up_off_alt9

chat_bubble_outline0

repeat2

shareShare

brian stevens

@addvin

7 months ago

And was great to see the Red Hat and Google effort announced by my friend the brilliant Amin Vahdat.

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

NVIDIA AI Developer

@nvidiaaidev

5 months ago

The llm-d project is a major step forward for the #opensource AI ecosystem, and we are proud to be one of the founding contributors, reflecting our commitment to collaboration as a catalyst for innovation in generative AI. As generative and agentic AI continue to evolve,

thumb_up_off_alt27

chat_bubble_outline0

repeat9

shareShare

Mark Collier 柯理怀

@sparkycollier

5 months ago

Really excited to see the emergence of llm-d brian stevens ! Inference is the biggest workload in human history and the open source tools need to keep evolving to serve it

thumb_up_off_alt11

chat_bubble_outline0

repeat2

shareShare

Red Hat AI

@redhat_ai

5 months ago

Thanks to the LMCache Lab team for joining forces with Red Hat on llm-d! llm-d is a new open source project for scalable, efficient distributed LLM inference with vLLM. Learn more about our collaboration here: blog.lmcache.ai/2025-05-22-red…

thumb_up_off_alt30

chat_bubble_outline0

repeat8

shareShare