Scaling Distributed Machine Learning with In-Network Aggregation

Training complex machine learning models in parallel is an increasinglyimportant workload. We accelerate distributed parallel training by designing acommunication primitive that uses a programmable switch dataplane to execute akey step of the training process. Our approach, SwitchML, reduces the volume ofexchanged data by aggregating the model updates from multiple workers in thenetwork. We co-design the switch processing with the end-host protocols and MLframeworks to provide a robust, efficient solution that speeds up training byup to 300%, and at least by 20% for a number of real-world benchmark models.

Scaling Distributed Machine Learning with In-Network Aggregation

Files

Type

Authors

KAUST Department

Date

Summary

Abstract

Publisher

arXiv

Additional Links

Permanent link to this record

Collections