Tackling the Communication Bottlenecks of Distributed Deep Learning Training Workloads
Permanent link to this recordhttp://hdl.handle.net/10754/693744
MetadataShow full item record
AbstractDeep Neural Networks (DNNs) find widespread applications across various domains, including computer vision, recommendation systems, and natural language processing. Despite their versatility, training DNNs can be a time-consuming process, and accommodating large models and datasets on a single machine is often impractical. To tackle these challenges, distributed deep learning (DDL) training workloads have gained increasing significance. However, DDL training introduces synchronization requirements among nodes, and the mini-batch stochastic gradient descent algorithm heavily burdens network connections. This dissertation proposes, analyzes, and evaluates three solutions addressing the communication bottleneck in DDL learning workloads. The first solution, SwitchML, introduces an in-network aggregation (INA) primitive that accelerates DDL workloads. By aggregating model updates from multiple workers within the network, SwitchML reduces the volume of exchanged data. This approach, which incorporates switch processing, end-host protocols, and Deep Learning frameworks, enhances training speed by up to 5.5 times for real-world benchmark models. The second solution, OmniReduce, is an efficient streaming aggregation system designed for sparse collective communication. It optimizes performance for parallel computing applications, such as distributed training of large-scale recommendation systems and natural language processing models. OmniReduce achieves maximum effective bandwidth utilization by transmitting only nonzero data blocks and leveraging fine-grained parallelization and pipelining. Compared to state-of-the-art TCP/IP and RDMA network solutions, OmniReduce outperforms them by 3.5 to 16 times, delivering significantly better performance for network-bottlenecked DNNs, even at 100 Gbps. The third solution, CoInNetFlow, addresses congestion in shared data centers, where multiple DNN training jobs compete for bandwidth on the same node. The study explores the feasibility of coflow scheduling methods in hierarchical and multi-tenant in-network aggregation communication patterns. CoInNetFlow presents an innovative utilization of the Sincronia priority assignment algorithm. Through packet-level DDL job simulation, the research demonstrates that appropriate weighting functions, transport layer priority scheduling, and gradient compression on low-priority tensors can significantly improve the median Job Completion Time Inflation by over $70\%$. Collectively, this dissertation contributes to mitigating the network communication bottleneck in distributed deep learning. The proposed solutions can enhance the efficiency and speed of distributed deep learning systems, ultimately improving the performance of DNN training across various domains.