## Search

Now showing items 1-9 of 9

JavaScript is disabled for your browser. Some features of this site may not work without it.

Author

Hanzely, Filip (9)

Richtarik, Peter (9)Mishchenko, Konstantin (2)Dutta, Aritra (1)Gorbunov, Eduard (1)View MoreDepartment
Applied Mathematics and Computational Science Program (9)

Computer Science Program (9)

Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division (9)Applied Mathematics and Computational Science (5)Computer Science (5)Publisher
arXiv (9)

TypePreprint (9)Year (Issue Date)2019 (5)2018 (4)Item AvailabilityOpen Access (9)

Now showing items 1-9 of 9

- List view
- Grid view
- Sort Options:
- Relevance
- Title Asc
- Title Desc
- Issue Date Asc
- Issue Date Desc
- Submit Date Asc
- Submit Date Desc
- Results Per Page:
- 5
- 10
- 20
- 40
- 60
- 80
- 100

SEGA: Variance Reduction via Gradient Sketching

Hanzely, Filip; Mishchenko, Konstantin; Richtarik, Peter (arXiv, 2018-09-09) [Preprint]

We propose a randomized first order optimization method--SEGA (SkEtchedGrAdient method)-- which progressively throughout its iterations builds avariance-reduced estimate of the gradient from random linear measurements(sketches) of the gradient obtained from an oracle. In each iteration, SEGAupdates the current estimate of the gradient through a sketch-and-projectoperation using the information provided by the latest sketch, and this issubsequently used to compute an unbiased estimate of the true gradient througha random relaxation procedure. This unbiased estimate is then used to perform agradient step. Unlike standard subspace descent methods, such as coordinatedescent, SEGA can be used for optimization problems with a non-separableproximal term. We provide a general convergence analysis and prove linearconvergence for strongly convex objectives. In the special case of coordinatesketches, SEGA can be enhanced with various techniques such as importancesampling, minibatching and acceleration, and its rate is up to a small constantfactor identical to the best-known rate of coordinate descent.

Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches

Hanzely, Filip; Richtarik, Peter (arXiv, 2018-09-25) [Preprint]

Accelerated coordinate descent is a widely popular optimization algorithm dueto its efficiency on large-dimensional problems. It achieves state-of-the-artcomplexity on an important class of empirical risk minimization problems. Inthis paper we design and analyze an accelerated coordinate descent (ACD) methodwhich in each iteration updates a random subset of coordinates according to anarbitrary but fixed probability law, which is a parameter of the method. If allcoordinates are updated in each iteration, our method reduces to the classicalaccelerated gradient descent method AGD of Nesterov. If a single coordinate isupdated in each iteration, and we pick probabilities proportional to the squareroots of the coordinate-wise Lipschitz constants, our method reduces to thecurrently fastest coordinate descent method NUACDM of Allen-Zhu, Qu,Richt\'{a}rik and Yuan. While mini-batch variants of ACD are more popular and relevant in practice,there is no importance sampling for ACD that outperforms the standard uniformmini-batch sampling. Through insights enabled by our general analysis, wedesign new importance sampling for mini-batch ACD which significantlyoutperforms previous state-of-the-art minibatch ACD in practice. We prove arate that is at most ${\cal O}(\sqrt{\tau})$ times worse than the rate ofminibatch ACD with uniform sampling, but can be ${\cal O}(n/\tau)$ timesbetter, where $\tau$ is the minibatch size. Since in modern supervised learningtraining systems it is standard practice to choose $\tau \ll n$, and often$\tau={\cal O}(1)$, our method can lead to dramatic speedups. Lastly, we obtainsimilar results for minibatch nonaccelerated CD as well, achieving improvementson previous best rates.

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

Hanzely, Filip; Richtarik, Peter (arXiv, 2019-05-27) [Preprint]

We propose a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both. In special cases, our method reduces to several known and previously thought to be unrelated methods, such as {\tt SAGA}, {\tt LSVRG}, {\tt JacSketch}, {\tt SEGA} and {\tt ISEGA}, and their arbitrary sampling and proximal generalizations. However, we also highlight a large number of new specific algorithms with interesting properties. We provide a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions. With this theorem we recover best-known and sometimes improved rates for known methods arising in special cases. As a by-product, we provide the first unified method and theory for stochastic gradient and stochastic coordinate descent type methods.

A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent

Gorbunov, Eduard; Hanzely, Filip; Richtarik, Peter (arXiv, 2019-05-27) [Preprint]

In this paper we introduce a unified analysis of a large family of variants of proximal stochastic gradient descent ({\tt SGD}) which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. We show that our framework includes methods with and without the following tricks, and their combinations: variance reduction, importance sampling, mini-batch sampling, quantization, and coordinate sub-sampling. As a by-product, we obtain the first unified theory of {\tt SGD} and randomized coordinate descent ({\tt RCD}) methods, the first unified theory of variance reduced and non-variance-reduced {\tt SGD} methods, and the first unified theory of quantized and non-quantized methods. A key to our approach is a parametric assumption on the iterates and stochastic gradients. In a single theorem we establish a linear convergence result under this assumption and strong-quasi convexity of the loss function. Whenever we recover an existing method as a special case, our theorem gives the best known complexity result. Our approach can be used to motivate the development of new useful methods, and offers pre-proved convergence guarantees. To illustrate the strength of our approach, we develop five new variants of {\tt SGD}, and through numerical experiments demonstrate some of their properties.

99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it

Mishchenko, Konstantin; Hanzely, Filip; Richtarik, Peter (arXiv, 2019-06-04) [Preprint]

It is well known that many optimization methods, including SGD, SAGA, andAccelerated SGD for over-parameterized models, do not scale linearly in theparallel setting. In this paper, we present a new version of block coordinatedescent that solves this issue for a number of methods. The core idea is tomake the sampling of coordinate blocks on each parallel unit independent of theothers. Surprisingly, we prove that the optimal number of blocks to be updatedby each of $n$ units in every iteration is equal to $m/n$, where $m$ is thetotal number of blocks. As an illustration, this means that when $n=100$parallel units are used, $99\%$ of work is a waste of time. We demonstrate thatwith $m/n$ blocks used by each unit the iteration complexity often remains thesame. Among other applications which we mention, this fact can be exploited inthe setting of distributed optimization to break the communication bottleneck.Our claims are justified by numerical experiments which demonstrate almost aperfect match with our theory on a number of datasets.

Best Pair Formulation & Accelerated Scheme for Non-convex Principal Component Pursuit

Dutta, Aritra; Hanzely, Filip; Liang, Jingwei; Richtarik, Peter (arXiv, 2019-05-25) [Preprint]

The best pair problem aims to find a pair of points that minimize the distance between two disjoint sets. In this paper, we formulate the classical robust principal component analysis (RPCA) as the best pair; which was not considered before. We design an accelerated proximal gradient scheme to solve it, for which we show global convergence, as well as the local linear rate. Our extensive numerical experiments on both real and synthetic data suggest that the algorithm outperforms relevant baseline algorithms in the literature.

Fastest Rates for Stochastic Mirror Descent Methods

Hanzely, Filip; Richtarik, Peter (arXiv, 2018-03-20) [Preprint]

Relative smoothness - a notion introduced by Birnbaum et al. (2011) and rediscovered by Bauschke et al. (2016) and Lu et al. (2016) - generalizes the standard notion of smoothness typically used in the analysis of gradient type methods. In this work we are taking ideas from well studied field of stochastic convex optimization and using them in order to obtain faster algorithms for minimizing relatively smooth functions. We propose and analyze two new algorithms: Relative Randomized Coordinate Descent (relRCD) and Relative Stochastic Gradient Descent (relSGD), both generalizing famous algorithms in the standard smooth setting. The methods we propose can be in fact seen as a particular instances of stochastic mirror descent algorithms. One of them, relRCD corresponds to the first stochastic variant of mirror descent algorithm with linear convergence rate.

A Privacy Preserving Randomized Gossip Algorithm via Controlled Noise Insertion

Hanzely, Filip; Konečný, Jakub; Loizou, Nicolas; Richtarik, Peter; Grishchenko, Dmitry (arXiv, 2019-01-27) [Preprint]

In this work we present a randomized gossip algorithm for solving the averageconsensus problem while at the same time protecting the information about theinitial private values stored at the nodes. We give iteration complexity boundsfor the method and perform extensive numerical experiments.

Accelerated Bregman Proximal Gradient Methods for Relatively Smooth Convex Optimization

Hanzely, Filip; Richtarik, Peter; Xiao, Lin (arXiv, 2018-08-09) [Preprint]

We consider the problem of minimizing the sum of two convex functions: one isdifferentiable and relatively smooth with respect to a reference convexfunction, and the other can be nondifferentiable but simple to optimize. Therelatively smooth condition is much weaker than the standard assumption ofuniform Lipschitz continuity of the gradients, thus significantly increases thescope of potential applications. We present accelerated Bregman proximalgradient (ABPG) methods that employ the Bregman distance of the referencefunction as the proximity measure. These methods attain an $O(k^{-\gamma})$convergence rate in the relatively smooth setting, where $\gamma\in [1, 2]$ isdetermined by a triangle scaling property of the Bregman distance. We developadaptive variants of the ABPG method that automatically ensure the bestpossible rate of convergence and argue that the $O(k^{-2})$ rate is attainablein most cases. We present numerical experiments with three applications:D-optimal experiment design, Poisson linear inverse problem, andrelative-entropy nonnegative regression. In all experiments, we obtainnumerical certificates showing that these methods do converge with the$O(k^{-2})$ rate.

The export option will allow you to export the current search results of the entered query to a file. Different formats are available for download. To export the items, click on the button corresponding with the preferred download format.

By default, clicking on the export buttons will result in a download of the allowed maximum amount of items. For anonymous users the allowed maximum amount is 50 search results.

To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.