DeepSpeed (@deepspeedai) Twitter Tweets • TwiCopy

Lewis Tunstall

2 years ago

🪁 Today we're releasing the code to train your very own Zephyr models! We've worked hard to make this as accessible as possible, so you can run: 🏋️‍♂️ Full fine-tuning with @MSFTDeepSpeed ZeRO-3 on A100s 🐭 LoRA or QLoRA fine-tuning on consumer GPUs Code: github.com/huggingface/al…

thumb_up_off_alt711

chat_bubble_outline7

repeat174

shareShare

Stas Bekman

@stasbekman

2 years ago

Hear, hear, AMD MI300Xs have started to emerge much sooner than expected. Here is a 2-part benchmarks report on performing BLOOM-176B inference using @MSFTDeepSpeed optimized for AMD MI300X. 1. evp.cloud/post/diving-de… 2. evp.cloud/post/diving-de… This was published in response

thumb_up_off_alt61

chat_bubble_outline3

repeat12

shareShare

Simo Ryu

@cloneofsimo

2 years ago

So you've had your fun with Andrej Karpathy 's mingpt. Now its time to scale : introducing min-max-gpt: really small codebase that scales with help of @MSFTDeepSpeed . No huggingface accelerate, transformer. Just deepspeed + torch: maximum hackability github.com/cloneofsimo/mi…

So you've had your fun with <a href="/karpathy/">Andrej Karpathy</a> 's mingpt. Now its time to scale : introducing min-max-gpt: really small codebase that scales with help of @MSFTDeepSpeed . No huggingface accelerate, transformer. Just deepspeed + torch: maximum hackability

github.com/cloneofsimo/mi…

thumb_up_off_alt250

chat_bubble_outline7

repeat35

shareShare

DeepSpeed

@deepspeedai

2 years ago

#DeepSpeed joins forces with University of Sydney to unveil an exciting tech #FP6. Just supply your FP16 models, and we deliver: 🚀 1.5x performance boost for #LLMs serving on #GPUs 🚀 Innovative (4+2)-bit system design 🚀 Quality-preserving quantization link: github.com/microsoft/Deep…

#DeepSpeed joins forces with <a href="/Sydney_Uni/">University of Sydney</a> to unveil an exciting tech #FP6. Just supply your FP16 models, and we deliver:
🚀 1.5x performance boost for #LLMs serving on #GPUs
🚀 Innovative (4+2)-bit system design
🚀 Quality-preserving quantization
link: github.com/microsoft/Deep…

thumb_up_off_alt168

chat_bubble_outline1

repeat26

shareShare

DeepSpeed

@deepspeedai

a year ago

Introducing Universal Checkpointing for boosting training efficiency. - Change parallelism (PP, SP, TP, ZeRO-DP) or GPU count mid-stream - Improve resilience by scaling down to healthy nodes💪 - Increase throughput by scaling up to elastic nodes🚀 Blog: rb.gy/aup3pn

thumb_up_off_alt23

chat_bubble_outline0

repeat5

shareShare

DeepSpeed

@deepspeedai

a year ago

Introducing DeepNVMe, a suite of optimizations for fast and efficient I/O operations in DL applications. - POSIX-style APIs - Direct HBM/NVMe xfers via NVIDIA GDS - Cheap Inference scaling via NVMe-Offload Blog: shorturl.at/l7Oue Microsoft Azure NVIDIA Data Center #FMS24 #GPUDirect

thumb_up_off_alt55

chat_bubble_outline0

repeat16

shareShare

Comet

@cometml

a year ago

💡Check out Comet’s latest integration with DeepSpeed, a deep learning optimization library! 🤝With the @MSFTDeepSpeed + Comet integration automatically start logging training metrics generated by DeepSpeed. Try the quick-start Colab to get started: colab.research.google.com/github/comet-m…

thumb_up_off_alt12

chat_bubble_outline0

repeat2

shareShare

DeepSpeed

@deepspeedai

a year ago

Announcing that DeepSpeed now runs natively on Windows. This exciting combination unlocks DeepSpeed optimizations to Windows users and empowers more people and organizations with AI innovations. - HF Inference & Finetuning - LoRA - CPU Offload Blog: shorturl.at/a7TF8

thumb_up_off_alt38

chat_bubble_outline1

repeat6

shareShare

DeepSpeed

@deepspeedai

a year ago

Great to see the amazing DeepSpeed optimizations from Guanhua Wang, Heyang Qin, Masahiro Tanaka, Quentin Anthony, and Sam Ade Jacobs presented by Ammar Ahmad Awan at MUG '24.

thumb_up_off_alt9

chat_bubble_outline0

repeat4

shareShare

DeepSpeed

@deepspeedai

a year ago

Introducing Domino: a novel zero-cost communication tensor parallelism (TP) training engine for both single node and multi-node settings. - Near-complete communication hiding - Novel multi-node scalable TP solution Blog: github.com/microsoft/Deep…

thumb_up_off_alt205

chat_bubble_outline0

repeat68

shareShare

DeepSpeed

@deepspeedai

a year ago

🚀Introducing Ulysses-Offload🚀 - Unlock the power of long context LLM training and finetuning with our latest system optimizations - Train LLaMA3-8B on 2M tokens context using 4xA100-80GB - Achieve over 55% MFU Blog: shorturl.at/Spx6Y Tutorial: shorturl.at/bAWu5

thumb_up_off_alt97

chat_bubble_outline1

repeat30

shareShare

LF AI & Data Foundation

@lfaidatafdn

9 months ago

🚀 Excited to introduce DeepSpeed, a deep learning optimization library from Microsoft! It simplifies distributed training and inference, making AI scaling more efficient and cost-effective. Learn more 👉 hubs.la/Q0351DJC0 #DeepSpeed #AI #OpenSource #LFAIData

🚀 Excited to introduce DeepSpeed, a deep learning optimization library from <a href="/Microsoft/">Microsoft</a>! It simplifies distributed training and inference, making AI scaling more efficient and cost-effective.

Learn more 👉 hubs.la/Q0351DJC0

#DeepSpeed #AI #OpenSource #LFAIData

thumb_up_off_alt34

chat_bubble_outline1

repeat9

shareShare

xr-5 🐀

@xariusrke

8 months ago

1/4⚡️nanoton now supports DoMiNo with intra-layer communication overlapping, achieving 60% communication hiding for tensor parallelism (TP) in both the forward and backward passes while maintaining the same training loss.

thumb_up_off_alt73

chat_bubble_outline4

repeat15

shareShare

DeepSpeed

@deepspeedai

7 months ago

AutoTP + ZeRO Training for HF Models - Enhance HF post-training with larger models, batches, & contexts - 4x faster LLAMA3 fine-tuning with TP=2 vs TP=1 - No code changes needed Blog: tinyurl.com/5n8nfs2w

thumb_up_off_alt76

chat_bubble_outline1

repeat20

shareShare

DeepSpeed

@deepspeedai

6 months ago

Introducing 🚀DeepCompile🚀: compiler-based distributed training optimizations. - Automatic parallelization & profile-guided optimizations - Enable ZeRO1, ZeRO3, Offloading, etc. via compiler passes - 1.2X-7X speedups over manual ZeRO1/ZeRO3/Offloading tinyurl.com/8cys28xk

thumb_up_off_alt309

chat_bubble_outline1

repeat54

shareShare

DeepSpeed

@deepspeedai

6 months ago

Come hear all the exciting DeepSpeed updates at the upcoming PyTorch Day France 2025 DeepSpeed – Efficient Training Scalability for Deep Learning Models - sched.co/21nyy Sched

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

PyTorch

@pytorch

6 months ago

PyTorch Foundation has expanded into an umbrella foundation. vLLM and DeepSpeed have been accepted as hosted projects, advancing community-driven AI across the full lifecycle. Supporting quotes provided by the following members: AMD, Arm, Amazon Web Services, Google, Huawei,

PyTorch Foundation has expanded into an umbrella foundation. <a href="/vllm_project/">vLLM</a> and <a href="/DeepSpeedAI/">DeepSpeed</a> have been accepted as hosted projects, advancing community-driven AI across the full lifecycle.

Supporting quotes provided by the following members: <a href="/AMD/">AMD</a>, <a href="/Arm/">Arm</a>, <a href="/AWS/">Amazon Web Services</a>, <a href="/Google/">Google</a>, <a href="/Huawei/">Huawei</a>,

thumb_up_off_alt235

chat_bubble_outline8

repeat46

shareShare

Stas Bekman

@stasbekman

4 months ago

My first project at Snowflake AI Research is complete! I present to you Arctic Long Sequence Training (ALST) Paper: arxiv.org/abs/2506.13996 Blog: snowflake.com/en/engineering… ALST is a set of modular, open-source techniques that enable training on sequences up to 15 million

My first project at <a href="/Snowflake/">Snowflake</a> AI Research is complete!

I present to you Arctic Long Sequence Training (ALST)

Paper: arxiv.org/abs/2506.13996
Blog: snowflake.com/en/engineering…

ALST is a set of modular, open-source techniques that enable training on sequences up to 15 million

thumb_up_off_alt369

chat_bubble_outline16

repeat63

shareShare

DeepSpeed

@deepspeedai

4 months ago

Kudos to Xinyu for giving an excellent presentation of DeepSpeed Universal Checkpointing (UCP) paper at USENIX ATC 2015.

thumb_up_off_alt15

chat_bubble_outline0

repeat2

shareShare