BEGIN ARTICLE PREVIEW:
In February, Microsoft introduced its open-source deep learning training optimization library DeepSpeed with memory optimization technology ZeRO (Zero Redundancy Optimizer), which helped build the 17-billion-parameter Turing Natural Language Generation model (T-NLG). In step with its AI at Scale initiative, Microsoft has now released four additional DeepSpeed technologies to enable even faster training times, whether on supercomputers or a single GPU.
Example 3D parallelism with 32 workers.
3D parallelism is a combination of three parallelism approaches — ZeRO-powered data parallelism (ZeRO-DP), pipeline parallelism, and tensor-slicing model parallelism — that adapts to the varying needs of workload requirements while achieving “near-perfect memory-scaling and throughput-scaling efficiency.” The new feature allows DeepSpeed to train a language model with one trillionparameters using as few as 800 NVIDIA V100 GPUs.
The second DeepSpeed add-on, ZeRO-Offload, exploits computational and memory resources on both GPUs and their host CPUs, and will be of interest to deep learning practitioners with limited GPU resources. The key technology behind ZeRO-Offload is ZeRO-2, which offloads optimizer states and gradients onto CPU memory to enable a single NVIDIA V100 GPU to train models with up to 13-billion-parameter — 10x larger than the current state-of-the-art.
Architecture of ZeRO-Offload
The new Sparse Attention (SA) kernels technology meanwhile addresses the limitations …
END ARTICLE PREVIEW