Mobius Labs (@mobius_labs) Twitter Tweets • TwiCopy

Mobius Labs

@mobius_labs

+ Follow

Multimodal AI for the world's scale.
Proponents of Open Source and Open Intelligence.
mobiusml.github.io/blog/ for some of our recent work.

ID: 982264845591494660

linkhttp://www.mobiuslabs.com calendar_today06-04-2018 14:32:43

385 Tweet

3,3K Followers

200 Following

Gate.io

@gate_io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

So I run evaluation on Gemma 3 12B QAT vs. HQQ. HQQ takes a few seconds to quantize the model and outperforms the QAT version (AWQ format) while using a higher group-size. With GemLite bfp16 support, you can run quantized Gemma 3 faster without performance issues 🫡

thumb_up_off_alt76

chat_bubble_outline3

repeat12

shareShare

Mobius Labs

@mobius_labs

3 months ago

Initial attempts on brining Gemlite to AMD MI300X

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

mobicham

@mobicham

3 months ago

GemLite is significantly outperforming the default A16W4 vLLM kernel on the MI300X 🚀

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare

Mobius Labs

@mobius_labs

3 months ago

GemLite: I feel the need, the need for speed!

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

mobicham

@mobicham

3 months ago

Well optimized Triton kernels can perform very well end-2-end, even competing with highly optimized kernels like Marlin.

thumb_up_off_alt41

chat_bubble_outline2

repeat4

shareShare

Mobius Labs

@mobius_labs

2 months ago

GemLite now supports vLLM vllm V1, which brings up to 1.25x faster inference speed vs V0! github.com/mobiusml/gemli…

thumb_up_off_alt12

chat_bubble_outline1

repeat3

shareShare

mobicham

@mobicham

2 months ago

We just made inference 1.5x faster with larger batch-sizes compared to last week 🤯 - work in progress

thumb_up_off_alt67

chat_bubble_outline2

repeat5

shareShare

mobicham

@mobicham

2 months ago

Wait, Triton can be faster than Cutlass ? 🧐

thumb_up_off_alt48

chat_bubble_outline4

repeat2

shareShare

mobicham

@mobicham

a month ago

main Thien Tran Llama3.1 8B running A16W4 with FP16 accumulation on the 5090 RTX

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

mobicham

@mobicham

a month ago

GemLite 0.4.7 is out 🔥. It boosts performance by 5-10 tokens/sec end-2-end by using an interesting trick which might seem weird at first: sample the outputs from pre-allocated zeros. Explanation below 🧵 github.com/mobiusml/gemli…

thumb_up_off_alt62

chat_bubble_outline6

repeat15

shareShare

mobicham

@mobicham

a month ago

Damn. Luckily, we have HQQ that only takes 5 secs to quantize an 8B model, super useful to get started right away with any model.

thumb_up_off_alt12

chat_bubble_outline2

repeat3

shareShare

mobicham

@mobicham

a month ago

rocm support added 👀, mainly focusing on the MI300X github.com/mobiusml/gemli…

thumb_up_off_alt25

chat_bubble_outline1

repeat2

shareShare

mobicham

@mobicham

a month ago

GemLite runs fast on the MI300X, but there's still plenty of performance left to unlock

thumb_up_off_alt13

chat_bubble_outline0

repeat1

shareShare

Mobius Labs

@mobius_labs

17 days ago

FP4 weights meets high accuracy: Logit‐distillation bias correction for MXFP4 & NVFP4. On Llama-3.1-8B recovers ≥99% relative quality. Details at: mobiusml.github.io/fp4_blogpost/

thumb_up_off_alt38

chat_bubble_outline1

repeat5

shareShare

mobicham

@mobicham

10 days ago

Here's how to write a fast MXFP4/NVFP4 dequant GEMV kernel for batch-size=1: -Use a mapping + tl.gather to map the quant matrix block indices into double the fp4 range -Load the scales as e8m0.view(uint8) and use tl.exp2 to convert ~13% faster than SPLIT-K with tl dot_scaled

thumb_up_off_alt27

chat_bubble_outline0

repeat2

shareShare

mobicham

@mobicham

7 days ago

Simple and fast MXFP8 activation quant kernel: ✓ Padding-aware for arbitrary seq lens ✓ SM-aware unrolling to improve occupancy (large batches) ✓ ~10% faster than torch.compile in end-to-end inference

thumb_up_off_alt93

chat_bubble_outline1

repeat11

shareShare