Guillaume Le Strat (@guillaumelst) 's Twitter Profile
Guillaume Le Strat

@guillaumelst

@zml_ai

Tech, startups, data & music

Paris - South of France

ID: 1539415124

linkhttp://www.linkedin.com/in/Guillaume-Le-Strat calendar_today22-06-2013 20:02:32

244 Tweet

350 Followers

2,2K Following

Steeve Morin ๐Ÿ‡บ๐Ÿ‡ฆ (@steeve) 's Twitter Profile Photo

Since I got asked multiple times this week. Here is a fully self contained LLaMA2 built for ROCm/AMD from a Macbook with ZML. You can copy and unpack this tar on a machine _without_ ROCm installed and it will run on the GPU.

Since I got asked multiple times this week.
Here is a fully self contained LLaMA2 built for ROCm/AMD from a Macbook with <a href="/zml_ai/">ZML</a>.

You can copy and unpack this tar on a machine _without_ ROCm installed and it will run on the GPU.
Steeve Morin ๐Ÿ‡บ๐Ÿ‡ฆ (@steeve) 's Twitter Profile Photo

Hey folks, since we've been asked so many times, here is a quick demo of what we're building. This is a small LLaMA2 sharded on 1 NVIDIA RTX 4090 (in Paris), 1 AMD 6800XT (in Corendos's flat) and 1 Google Cloud TPU v2 over Tailscale. Exact same code, all built on my Mac

Yann LeCun (@ylecun) 's Twitter Profile Photo

ZML: a high-performance AI inference stack that can parallelize and run deep learning systems on lots of different hardware. It's out of stealth, impressive, and open source.

Guillaume Le Strat (@guillaumelst) 's Twitter Profile Photo

We've been cooking with the ZML team. We just dropped a new website to showcase what we do best: fast, robust and scalable inference. Check it out ๐Ÿ‘‰ zml.ai

Steeve Morin ๐Ÿ‡บ๐Ÿ‡ฆ (@steeve) 's Twitter Profile Photo

The tech preview of LLMD is out: - Easy Setup - Just mount your model and run - Cross-Platform GPU Support - Single container works on *both* NVIDIA and AMD GPUs - Lightweight - Only 2.4GB container size - High Performance Enjoy !

Erik Kaunismรคki (@erikkaum) 's Twitter Profile Photo

The ZML team and Steeve Morin cooked a new high-performance LLM inference engine. - lightweight 2.4GB container - easy on cross-platform - written in zig I just tested it and deployed on Hugging Face Inference Endpoints. You can try it out in 5min! ๐Ÿ‘‡

Steeve Morin ๐Ÿ‡บ๐Ÿ‡ฆ (@steeve) 's Twitter Profile Photo

The fastest LLM server is now faster with ROCm 6.4.1 and Tri Dao's newer flashattn. There are some cool packaging features on that image. First, 50mb compressed smaller. Second, through careful layering, it overlaps download and faster extraction. Complete Docker pull is ~20s.

The fastest LLM server is now faster with ROCm 6.4.1 and <a href="/tri_dao/">Tri Dao</a>'s newer flashattn.

There are some cool packaging features on that image. First, 50mb compressed smaller. Second, through careful layering, it overlaps download and faster extraction.
Complete Docker pull is ~20s.
Steeve Morin ๐Ÿ‡บ๐Ÿ‡ฆ (@steeve) 's Twitter Profile Photo

And after 1 week of work, here is zml/llmd running transparently on TPU with full prefill/decode paged attention. No code change, single flag, as it should be.