Mobius Labs (@mobius_labs) 's Twitter Profile
Mobius Labs

@mobius_labs

Multimodal AI for the world's scale.
Proponents of Open Source and Open Intelligence.
mobiusml.github.io/blog/ for some of our recent work.

ID: 982264845591494660

linkhttp://www.mobiuslabs.com calendar_today06-04-2018 14:32:43

385 Tweet

3,3K Followers

200 Following

mobicham (@mobicham) 's Twitter Profile Photo

So I run evaluation on Gemma 3 12B QAT vs. HQQ. HQQ takes a few seconds to quantize the model and outperforms the QAT version (AWQ format) while using a higher group-size. With GemLite bfp16 support, you can run quantized Gemma 3 faster without performance issues 🫡

So I run evaluation on Gemma 3 12B QAT vs. HQQ. 

HQQ takes a few seconds to quantize the model and outperforms the QAT version (AWQ format) while using a higher group-size. 

With GemLite bfp16 support, you can run quantized Gemma 3 faster without performance issues 🫡
mobicham (@mobicham) 's Twitter Profile Photo

Well optimized Triton kernels can perform very well end-2-end, even competing with highly optimized kernels like Marlin.

Well optimized Triton kernels can perform very well end-2-end, even competing with highly optimized kernels like Marlin.
mobicham (@mobicham) 's Twitter Profile Photo

GemLite 0.4.7 is out 🔥. It boosts performance by 5-10 tokens/sec end-2-end by using an interesting trick which might seem weird at first: sample the outputs from pre-allocated zeros. Explanation below 🧵 github.com/mobiusml/gemli…

mobicham (@mobicham) 's Twitter Profile Photo

Damn. Luckily, we have HQQ that only takes 5 secs to quantize an 8B model, super useful to get started right away with any model.

Mobius Labs (@mobius_labs) 's Twitter Profile Photo

FP4 weights meets high accuracy: Logit‐distillation bias correction for MXFP4 & NVFP4. On Llama-3.1-8B recovers ≥99% relative quality. Details at: mobiusml.github.io/fp4_blogpost/

mobicham (@mobicham) 's Twitter Profile Photo

Here's how to write a fast MXFP4/NVFP4 dequant GEMV kernel for batch-size=1: -Use a mapping + tl.gather to map the quant matrix block indices into double the fp4 range -Load the scales as e8m0.view(uint8) and use tl.exp2 to convert ~13% faster than SPLIT-K with tl dot_scaled

Here's how to write a fast MXFP4/NVFP4 dequant GEMV kernel for batch-size=1:
-Use a mapping + tl.gather to map the quant matrix block indices into double the fp4 range
-Load the scales as e8m0.view(uint8) and use tl.exp2 to convert

~13% faster than SPLIT-K with tl dot_scaled
mobicham (@mobicham) 's Twitter Profile Photo

Simple and fast MXFP8 activation quant kernel: ✓ Padding-aware for arbitrary seq lens ✓ SM-aware unrolling to improve occupancy (large batches) ✓ ~10% faster than torch.compile in end-to-end inference

Simple and fast MXFP8 activation quant kernel:
✓ Padding-aware for arbitrary seq lens  
✓ SM-aware unrolling to improve occupancy (large batches)  
✓ ~10% faster than torch.compile in end-to-end inference