Show simple item record

dc.contributor.authorBeznosikov, Aleksandr
dc.contributor.authorHorvath, Samuel
dc.contributor.authorRichtarik, Peter
dc.contributor.authorSafaryan, Mher
dc.date.accessioned2020-03-24T10:41:04Z
dc.date.available2020-03-24T10:41:04Z
dc.date.issued2020-02-27
dc.identifier.urihttp://hdl.handle.net/10754/662280
dc.description.abstractIn the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact {\em biased} compressors often show superior performance in practice when compared to the much more studied and understood {\em unbiased} compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. Our {\em distributed} SGD method enjoys the ergodic rate $\mathcal{O}\left(\frac{\delta L \exp(-K) }{\mu} + \frac{(C + D)}{K\mu}\right)$, where $\delta$ is a compression parameter which grows when more compression is applied, $L$ and $\mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose a new highly performing biased compressor---combination of Top-$k$ and natural dithering---which in our experiments outperforms all other compression techniques.
dc.publisherarXiv
dc.relation.urlhttps://arxiv.org/pdf/2002.12410
dc.rightsArchived with thanks to arXiv
dc.titleOn Biased Compression for Distributed Learning
dc.typePreprint
dc.contributor.departmentComputer Science Program
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
dc.contributor.departmentStatistics
dc.contributor.departmentStatistics Program
dc.eprint.versionPre-print
dc.identifier.arxivid2002.12410
kaust.personBeznosikov, Aleksandr
kaust.personHorvath, Samuel
kaust.personRichtarik, Peter
kaust.personSafaryan, Mher
refterms.dateFOA2020-03-24T10:41:56Z


Files in this item

Thumbnail
Name:
Preprintfile1.pdf
Size:
1.581Mb
Format:
PDF
Description:
Pre-print

This item appears in the following Collection(s)

Show simple item record