Show simple item record

dc.contributor.authorJin, Yuchen
dc.contributor.authorZhou, Tianyi
dc.contributor.authorZhao, Liangyu
dc.contributor.authorZhu, Yibo
dc.contributor.authorGuo, Chuanxiong
dc.contributor.authorCanini, Marco
dc.contributor.authorKrishnamurthy, Arvind
dc.date.accessioned2022-03-07T14:06:52Z
dc.date.available2021-05-26T07:16:19Z
dc.date.available2022-03-07T14:06:52Z
dc.date.issued2021
dc.identifier.urihttp://hdl.handle.net/10754/669248
dc.description.abstractGradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-k sparsification, sometimes with k as little as 0.1% of the gradient size, enables training to the same model quality as the uncompressed case for a similar iteration count. From the optimization perspective, we find that Top-k is the communication-optimal sparsifier given a per-iteration k element budget. We argue that to further the benefits of gradient sparsification, especially for DNNs, a different perspective is necessary — one that moves from per-iteration optimality to consider optimality for the entire training. We identify that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training. Then, we propose a communication complexity model that minimizes the total error under a communication budget for the entire training. We find that the hard-threshold sparsifier, a variant of the Top-k sparsifier with k determined by a constant hard-threshold, is the optimal sparsifier for this model. Motivated by this, we provide convex and non-convex convergence analyses for the hard-threshold sparsifier with errorfeedback. We show that hard-threshold has the same asymptotic convergence and linear speedup property as SGD in both the case, and unlike with Top-k sparsifier, has no impact due to data-heterogeneity. Our diverse experiments on various DNNs and a logistic regression model demonstrate that the hardthreshold sparsifier is more communication-efficient than Top-k. Code is available at https://github.com/sands-lab/rethinking-sparsification.
dc.description.sponsorshipWe would like to thank the anonymous ICLR reviewers for their valuable feedback. We would also like to thank Damien Fay for his suggestions on time series analysis. This work was partially supported by DARPA. For computer time, this research used the resources at ByteDance and the Supercomputing Laboratory at KAUST.
dc.rightsArchived with thanks to arXiv
dc.titleAUTOLRS: AUTOMATIC LEARNING-RATE SCHEDULE BY BAYESIAN OPTIMIZATION ON THE FLY
dc.typeConference Paper
dc.contributor.departmentComputer Science Program
dc.contributor.departmentComputer, Electrical and Mathematical Science and Engineering (CEMSE) Division
dc.conference.dateMay 04 2021
dc.conference.nameICLR 2021
dc.conference.locationVienna, Austria
dc.eprint.versionPublisher's Version/PDF
dc.contributor.institutionUniversity of Washington
dc.contributor.institutionByteDance Inc
kaust.personCanini, Marco
refterms.dateFOA2021-05-26T07:16:45Z
kaust.acknowledged.supportUnitSupercomputing Laboratory at KAUST


Files in this item

Thumbnail
Name:
autolrs.iclr21.pdf
Size:
924.6Kb
Format:
PDF
Description:
Publisher's Version/PDF

This item appears in the following Collection(s)

Show simple item record

VersionItemEditorDateSummary

*Selected version