Philip Kiely (@philip_kiely) 's Twitter Profile
Philip Kiely

@philip_kiely

DevRel @basetenco | Not an LLM (yet)

Author: wfsd.com & lifechangingemail.com

ID: 1070356464873758720

linkhttp://philipkiely.com calendar_today05-12-2018 16:37:22

1,1K Tweet

2,2K Followers

335 Following

Baseten (@basetenco) 's Twitter Profile Photo

We have day 0 support for #Qwen3 by Alibaba Qwen on Baseten using SGLang. Qwen 3 235B's architecture benefits from both Tensor Parallelism and Expert Parallelism to run Attention and Sparse MoE efficiently across 4 or 8 H100 GPUs depending on quantization. More in 🧵

We have day 0 support for #Qwen3 by Alibaba Qwen on Baseten using SGLang.

Qwen 3 235B's architecture benefits from both Tensor Parallelism and Expert Parallelism to run Attention and Sparse MoE efficiently across 4 or 8 H100 GPUs depending on quantization.  

More in 🧵
Philip Kiely (@philip_kiely) 's Twitter Profile Photo

There's a lot to be excited about with Qwen 3: - Fits on 4xH100 - 1/4 the cost of DeepSeek-R1 in production - "Hybrid thinking" makes reasoning optional - Continues the Qwen tradition of being great at coding Deployment w/ SGLang + vibe check in the video!

Baseten (@basetenco) 's Twitter Profile Photo

Early benchmarks of Qwen 3 with SGLang show promising initial results and key avenues for improvement. We're seeing: - Up to 76 TPS per user for real-time - Up to 4600 total token throughput for batch - 32 concurrent requests as a good balance for prod Details in 🧵

Early benchmarks of Qwen 3 with SGLang show promising initial results and key avenues for improvement.

We're seeing:
- Up to 76 TPS per user for real-time
- Up to 4600 total token throughput for batch
- 32 concurrent requests as a good balance for prod

Details in 🧵
Elias (@eliasfiz) 's Twitter Profile Photo

People told us they want Orpheus TTS in production. So we partnered with Baseten as our preferred inference provider! Baseten runs Orpheus with: •⁠ ⁠Low latency (<200 ms TTFB) •⁠ ⁠High throughput (up to 48 real-time streams per H100) •⁠ ⁠Secure, worldwide infra

People told us they want Orpheus TTS in production.

So we partnered with <a href="/basetenco/">Baseten</a> as our preferred inference provider!

Baseten runs Orpheus with:

•⁠  ⁠Low latency (&lt;200 ms TTFB)
•⁠  ⁠High throughput (up to 48 real-time streams per H100)
•⁠  ⁠Secure, worldwide infra
zhyncs (@zhyncs42) 's Twitter Profile Photo

I’ll be joining my Baseten colleague Philip Kiely at the AI Engineer World’s Fair AI Engineer in San Francisco, June 3–5, to Introduce LLM serving with SGLang LMSYS Org. We’d love for you to stop by and exchange ideas in person!🤗

I’ll be joining my <a href="/basetenco/">Baseten</a> colleague <a href="/philip_kiely/">Philip Kiely</a> at the AI Engineer World’s Fair <a href="/aiDotEngineer/">AI Engineer</a> in San Francisco, June 3–5, to Introduce LLM serving with SGLang <a href="/lmsysorg/">LMSYS Org</a>. We’d love for you to stop by and exchange ideas in person!🤗
Philip Kiely (@philip_kiely) 's Twitter Profile Photo

Great chatting all things voice agents with kwindla today in his course! Main takeaway: infra problems > GPU problems for voice. 1. Network overhead between client & each model 2. Client code (streaming/websockets/sessions) 3. But STT/TTS optimization w/ TRT-LLM matters too

Great chatting all things voice agents with <a href="/kwindla/">kwindla</a> today in his course!

Main takeaway: infra problems &gt; GPU problems for voice.

1. Network overhead between client &amp; each model
2. Client code (streaming/websockets/sessions)
3. But STT/TTS optimization w/ TRT-LLM matters too
AI Engineer (@aidotengineer) 's Twitter Profile Photo

Announcing our speakers for the Voice track! ⚠️PSA: Tix nearly sold out, get em here: ti.to/software-3/ai-…… Featuring: kwindla, CEO Daily Sean DuBois, WebRTC and Realtime API OpenAI Brooke Hopkins, Founder coval @dkundel, Developer Experience OpenAI