Brian Keene (@bpkeene) 's Twitter Profile
Brian Keene

@bpkeene

Technical Staff @ @argmaxinc | former Apple ML Engineer with on-device inference

ID: 4717525021

linkhttps://www.linkedin.com/in/brian-keene-3b7712a2/ calendar_today06-01-2016 07:45:18

42 Tweet

178 Followers

209 Following

argmax (@argmaxinc) 's Twitter Profile Photo

Here is the compounded speedup when considering the `qmv` improvements on top. Note that speedups dramatically improve for short sequence lengths:

Here is the compounded speedup when considering the `qmv` improvements on top. Note that speedups dramatically improve for short sequence lengths:
Awni Hannun (@awnihannun) 's Twitter Profile Photo

LLMs are faster and more memory efficient in MLX! - All quantized models 30%+ faster h/t Angelos Katharopoulos - Fused attention for longer context can be 2x+ faster and use way less memory h/t Brian Keene Atila argmax Some tokens-per-second benchmarks for 7B Mistral:

LLMs are faster and more memory efficient in MLX!

- All quantized models 30%+ faster h/t <a href="/angeloskath/">Angelos Katharopoulos</a> 
- Fused attention for longer context can be 2x+ faster and use way less memory h/t <a href="/bpkeene/">Brian Keene</a> <a href="/atiorh/">Atila</a> <a href="/argmaxinc/">argmax</a>

Some tokens-per-second benchmarks for 7B Mistral:
clem 🤗 (@clementdelangue) 's Twitter Profile Photo

Love how Apple is advocating for on-device AI at WWDC . Local, smaller, specialized models are the future of private, secure and efficient AI.

Awni Hannun (@awnihannun) 's Twitter Profile Photo

SD3 runs locally with MLX thanks to the incredible work from argmax Super easy setup, docs here: github.com/argmaxinc/Diff… Takes < 30 seconds to generate an image on my M1 Max:

INIYSA (@lafaiel) 's Twitter Profile Photo

This is crazy. According to Qualcomm, the X Elite runs Whisper Base-En at 72 tok/s (13.8ms), while the A17 runs it at 237 tok/s Properly optimized hw&sw really matter

This is crazy. According to Qualcomm, the X Elite runs Whisper Base-En at 72 tok/s (13.8ms), 

while the A17 runs it at 237 tok/s

Properly optimized hw&amp;sw really matter
argmax (@argmaxinc) 's Twitter Profile Photo

FLUX.1-schnell on DiffusionKit with MLX Video demo of an M3 Max MacBook generating this 768x1360 image with bfloat16 weights in 39 seconds in thread. Further optimizations in flux. Install: pip install diffusionkit==0.3.0 Repo: github.com/argmaxinc/Diff…

FLUX.1-schnell on DiffusionKit with MLX
Video demo of an M3 Max MacBook generating this 768x1360 image with bfloat16 weights in 39 seconds in thread. Further optimizations in flux.

Install: pip install diffusionkit==0.3.0
Repo: github.com/argmaxinc/Diff…
Awni Hannun (@awnihannun) 's Twitter Profile Photo

Flux Schnell in the latest DiffusionKit with MLX is 30% faster and uses less RAM! pip install -U diffusionkit Generating some high quality images in < a minute on my 32GB M1 max laptop:

Awni Hannun (@awnihannun) 's Twitter Profile Photo

Generating images with 4-bit Flux Schnell on my M1 Max laptop is pretty awesome. Less than 30 seconds model loading and all, and uses about ~5GB peak RAM. Check-out DiffusionKit + MLX github.com/argmaxinc/Diff…

Awni Hannun (@awnihannun) 's Twitter Profile Photo

Generating images with DiffusionKit + Flux Schnell is much faster in the latest MLX On an M2 Ultra down to less than 9 seconds from close to 13 before. Docs here: github.com/argmaxinc/Diff…

Generating images with DiffusionKit + Flux Schnell is much faster in the latest MLX

On an M2 Ultra down to less than 9 seconds from close to 13 before.

Docs here: github.com/argmaxinc/Diff…
argmax (@argmaxinc) 's Twitter Profile Photo

WhisperKit-0.9 is out! - Faster Large v3 Turbo on Mac and iPhone - Fast Model Load on TestFlight App (Experimental) - Memory reduction for large input handling contributed by Kosta Eleftheriou TestFlight: testflight.apple.com/join/LPVOyJZW GitHub (MIT): github.com/argmaxinc/Whis… New models on

argmax (@argmaxinc) 's Twitter Profile Photo

WhisperKit on Android In collaboration with Qualcomm, WhisperKit is growing from Apple platforms to Android! Samsung Galaxy S24 running at 300tok/s: Links inđź§µ

argmax (@argmaxinc) 's Twitter Profile Photo

WhisperKit Benchmarks are live on Hugging Face! Speech-to-text systems are hard to benchmark holistically given trade-offs across latency, memory, energy efficiency and accuracy. On-device testing makes it doubly challenging. Here is our first version built with Gradio đź§µ

argmax (@argmaxinc) 's Twitter Profile Photo

We raised $8M and are thrilled to have Salesforce Ventures General Catalyst Julien Chaumond Amjad Masad Michele Catasta and other industry leader angels join us as investors. We are hiring across all positions! Our thoughts and job application links here: argmaxinc.com/blog/seed

We raised $8M and are thrilled to have <a href="/SalesforceVC/">Salesforce Ventures</a>  <a href="/generalcatalyst/">General Catalyst</a> <a href="/julien_c/">Julien Chaumond</a> <a href="/amasad/">Amjad Masad</a> <a href="/pirroh/">Michele Catasta</a> and other industry leader angels join us as investors.

We are hiring across all positions! Our thoughts and job application links here: argmaxinc.com/blog/seed
argmax (@argmaxinc) 's Twitter Profile Photo

Introducing WhisperKit Pro & SpeakerKit Pro We have built major performance and feature set upgrades to WhisperKit! We are calling it WhisperKit Pro, our fastest and most comprehensive on-device speech AI offering yet. SpeakerKit Pro is our new on-device inference framework for

argmax (@argmaxinc) 's Twitter Profile Photo

WhisperKit Android is now in Beta! WhisperKit is open for business across Android and Apple platforms. Links to code and benchmarks are below in the thread.

argmax (@argmaxinc) 's Twitter Profile Photo

Introducing SpeakerKit State-of-the-art on-device speaker diarization: - 10 minutes of audio processed in 3 seconds - 10 megabytes in total - 6-year-old devices supported Details and links to the demo app are in the thread.

Introducing SpeakerKit

State-of-the-art on-device speaker diarization:
- 10 minutes of audio processed in 3 seconds
- 10 megabytes in total
- 6-year-old devices supported

Details and links to the demo app are in the thread.
argmax (@argmaxinc) 's Twitter Profile Photo

Exciting SpeakerKit updates! - Faster inference and lower error rates across 13 benchmark datasets - Code and paper for benchmarks and system architecture are in the replies - Ability to set the speaker count to reduce the error rate even further

argmax (@argmaxinc) 's Twitter Profile Photo

Nvidia Frontier Speech Models on Argmax SDK Nvidia's top-ranking speech-to-text models are now seamlessly running on device with Argmax SDK, available today! Details in thread