## Search

Now showing items 1-10 of 23

JavaScript is disabled for your browser. Some features of this site may not work without it.

AuthorRichtarik, Peter (12)Hoehndorf, Robert (5)Hanzely, Filip (4)Kafkas, Senay (3)Dijk, Marten van (2)View MoreDepartment

Computer Science Program (23)

Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division (23)Computer Science (8)Computational Bioscience Research Center (CBRC) (7)Applied Mathematics and Computational Science Program (5)View MoreJournalSSRN Preprint submitted to Cell Stem Cell (1)KAUST Grant NumberFCC/1/1976-08-01 (4)URF/1/3454-01-01 (4)FCC/1/1976-04 (2)FCS/1/3657-02-01 (2)URF/1/3007-01 (2)View MorePublisherarXiv (16)Cold Spring Harbor Laboratory (6)Elsevier BV (1)SubjectAsynchronous Stochastic Optimization (1)bounded gradient (1)Hogwild! (1)infectious disease (1)knowledge graph (1)View MoreTypePreprint (23)Year (Issue Date)
2018 (23)

Item AvailabilityOpen Access (23)

Now showing items 1-10 of 23

- List view
- Grid view
- Sort Options:
- Relevance
- Title Asc
- Title Desc
- Issue Date Asc
- Issue Date Desc
- Submit Date Asc
- Submit Date Desc
- Results Per Page:
- 5
- 10
- 20
- 40
- 60
- 80
- 100

New Convergence Aspects of Stochastic Gradient Algorithms

Nguyen, Lam M.; Nguyen, Phuong Ha; Richtarik, Peter; Scheinberg, Katya; Takáč, Martin; Dijk, Marten van (arXiv, 2018-11-10) [Preprint]

The classical convergence analysis of SGD is carried out under the assumptionthat the norm of the stochastic gradient is uniformly bounded. While this mighthold for some loss functions, it is violated for cases where the objectivefunction is strongly convex. In Bottou et al. (2016), a new analysis ofconvergence of SGD is performed under the assumption that stochastic gradientsare bounded with respect to the true gradient norm. We show that for stochasticproblems arising in machine learning such bound always holds; and we alsopropose an alternative convergence analysis of SGD with diminishing learningrate regime, which results in more relaxed conditions than those in Bottou etal. (2016). We then move on the asynchronous parallel setting, and proveconvergence of Hogwild! algorithm in the same regime in the case of diminishedlearning rate. It is well-known that SGD converges if a sequence of learningrates $\{\eta_t\}$ satisfies $\sum_{t=0}^\infty \eta_t \rightarrow \infty$ and$\sum_{t=0}^\infty \eta^2_t < \infty$. We show the convergence of SGD forstrongly convex objective function without using bounded gradient assumptionwhen $\{\eta_t\}$ is a diminishing sequence and $\sum_{t=0}^\infty \eta_t\rightarrow \infty$. In other words, we extend the current state-of-the-artclass of learning rates satisfying the convergence of SGD.

A Stochastic Penalty Model for Convex and Nonconvex Optimization with Big Constraints

Mishchenko, Konstantin; Richtarik, Peter (arXiv, 2018-10-31) [Preprint]

The last decade witnessed a rise in the importance of supervised learningapplications involving {\em big data} and {\em big models}. Big data refers tosituations where the amounts of training data available and needed causesdifficulties in the training phase of the pipeline. Big model refers tosituations where large dimensional and over-parameterized models are needed forthe application at hand. Both of these phenomena lead to a dramatic increase inresearch activity aimed at taming the issues via the design of newsophisticated optimization algorithms. In this paper we turn attention to the{\em big constraints} scenario and argue that elaborate machine learningsystems of the future will necessarily need to account for a large number ofreal-world constraints, which will need to be incorporated in the trainingprocess. This line of work is largely unexplored, and provides ampleopportunities for future work and applications. To handle the {\em bigconstraints} regime, we propose a {\em stochastic penalty} formulation which{\em reduces the problem to the well understood big data regime}. Ourformulation has many interesting properties which relate it to the originalproblem in various ways, with mathematical guarantees. We give a number ofresults specialized to nonconvex loss functions, smooth convex functions,strongly convex functions and convex constraints. We show through experimentsthat our approach can beat competing approaches by several orders of magnitudewhen a medium accuracy solution is required.

Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches

Hanzely, Filip; Richtarik, Peter (arXiv, 2018-09-25) [Preprint]

Accelerated coordinate descent is a widely popular optimization algorithm dueto its efficiency on large-dimensional problems. It achieves state-of-the-artcomplexity on an important class of empirical risk minimization problems. Inthis paper we design and analyze an accelerated coordinate descent (ACD) methodwhich in each iteration updates a random subset of coordinates according to anarbitrary but fixed probability law, which is a parameter of the method. If allcoordinates are updated in each iteration, our method reduces to the classicalaccelerated gradient descent method AGD of Nesterov. If a single coordinate isupdated in each iteration, and we pick probabilities proportional to the squareroots of the coordinate-wise Lipschitz constants, our method reduces to thecurrently fastest coordinate descent method NUACDM of Allen-Zhu, Qu,Richt\'{a}rik and Yuan. While mini-batch variants of ACD are more popular and relevant in practice,there is no importance sampling for ACD that outperforms the standard uniformmini-batch sampling. Through insights enabled by our general analysis, wedesign new importance sampling for mini-batch ACD which significantlyoutperforms previous state-of-the-art minibatch ACD in practice. We prove arate that is at most ${\cal O}(\sqrt{\tau})$ times worse than the rate ofminibatch ACD with uniform sampling, but can be ${\cal O}(n/\tau)$ timesbetter, where $\tau$ is the minibatch size. Since in modern supervised learningtraining systems it is standard practice to choose $\tau \ll n$, and often$\tau={\cal O}(1)$, our method can lead to dramatic speedups. Lastly, we obtainsimilar results for minibatch nonaccelerated CD as well, achieving improvementson previous best rates.

Nonconvex Variance Reduced Optimization with Arbitrary Sampling

Horvath, Samuel; Richtarik, Peter (arXiv, 2018-09-11) [Preprint]

We provide the first importance sampling variants of variance reducedalgorithms for empirical risk minimization with non-convex loss functions. Inparticular, we analyze non-convex versions of SVRG, SAGA and SARAH. Our methodshave the capacity to speed up the training process by an order of magnitudecompared to the state of the art on real datasets. Moreover, we also improveupon current mini-batch analysis of these methods by proposing importancesampling for minibatches in this setting. Surprisingly, our approach can insome regimes lead to superlinear speedup with respect to the minibatch size,which is not usually present in stochastic optimization. All the above resultsfollow from a general analysis of the methods which works with arbitrarysampling, i.e., fully general randomized strategy for the selection of subsetsof examples to be sampled in each iteration. Finally, we also perform a novelimportance sampling analysis of SARAH in the convex setting.

SEGA: Variance Reduction via Gradient Sketching

Hanzely, Filip; Mishchenko, Konstantin; Richtarik, Peter (arXiv, 2018-09-09) [Preprint]

We propose a randomized first order optimization method--SEGA (SkEtchedGrAdient method)-- which progressively throughout its iterations builds avariance-reduced estimate of the gradient from random linear measurements(sketches) of the gradient obtained from an oracle. In each iteration, SEGAupdates the current estimate of the gradient through a sketch-and-projectoperation using the information provided by the latest sketch, and this issubsequently used to compute an unbiased estimate of the true gradient througha random relaxation procedure. This unbiased estimate is then used to perform agradient step. Unlike standard subspace descent methods, such as coordinatedescent, SEGA can be used for optimization problems with a non-separableproximal term. We provide a general convergence analysis and prove linearconvergence for strongly convex objectives. In the special case of coordinatesketches, SEGA can be enhanced with various techniques such as importancesampling, minibatching and acceleration, and its rate is up to a small constantfactor identical to the best-known rate of coordinate descent.

Accelerated Bregman Proximal Gradient Methods for Relatively Smooth Convex Optimization

Hanzely, Filip; Richtarik, Peter; Xiao, Lin (arXiv, 2018-08-09) [Preprint]

We consider the problem of minimizing the sum of two convex functions: one isdifferentiable and relatively smooth with respect to a reference convexfunction, and the other can be nondifferentiable but simple to optimize. Therelatively smooth condition is much weaker than the standard assumption ofuniform Lipschitz continuity of the gradients, thus significantly increases thescope of potential applications. We present accelerated Bregman proximalgradient (ABPG) methods that employ the Bregman distance of the referencefunction as the proximity measure. These methods attain an $O(k^{-\gamma})$convergence rate in the relatively smooth setting, where $\gamma\in [1, 2]$ isdetermined by a triangle scaling property of the Bregman distance. We developadaptive variants of the ABPG method that automatically ensure the bestpossible rate of convergence and argue that the $O(k^{-2})$ rate is attainablein most cases. We present numerical experiments with three applications:D-optimal experiment design, Poisson linear inverse problem, andrelative-entropy nonnegative regression. In all experiments, we obtainnumerical certificates showing that these methods do converge with the$O(k^{-2})$ rate.

Accelerated Gossip via Stochastic Heavy Ball Method

Loizou, Nicolas; Richtarik, Peter (arXiv, 2018-09-23) [Preprint]

In this paper we show how the stochastic heavy ball method (SHB) -- a popularmethod for solving stochastic convex and non-convex optimization problems--operates as a randomized gossip algorithm. In particular, we focus on twospecial cases of SHB: the Randomized Kaczmarz method with momentum and itsblock variant. Building upon a recent framework for the design and analysis ofrandomized gossip algorithms, [Loizou Richtarik, 2016] we interpret thedistributed nature of the proposed methods. We present novel protocols forsolving the average consensus problem where in each step all nodes of thenetwork update their values but only a subset of them exchange their privatevalues. Numerical experiments on popular wireless sensor networks showing thebenefits of our protocols are also presented.

SGD and Hogwild! Convergence Without the Bounded Gradients Assumption

Nguyen, Lam M.; Nguyen, Phuong Ha; Dijk, Marten van; Richtarik, Peter; Scheinberg, Katya; Takáč, Martin (arXiv, 2018-02-11) [Preprint]

Stochastic gradient descent (SGD) is the optimization algorithm of choice inmany machine learning applications such as regularized empirical riskminimization and training deep neural networks. The classical convergenceanalysis of SGD is carried out under the assumption that the norm of thestochastic gradient is uniformly bounded. While this might hold for some lossfunctions, it is always violated for cases where the objective function isstrongly convex. In (Bottou et al.,2016), a new analysis of convergence of SGDis performed under the assumption that stochastic gradients are bounded withrespect to the true gradient norm. Here we show that for stochastic problemsarising in machine learning such bound always holds; and we also propose analternative convergence analysis of SGD with diminishing learning rate regime,which results in more relaxed conditions than those in (Bottou et al.,2016). Wethen move on the asynchronous parallel setting, and prove convergence ofHogwild! algorithm in the same regime, obtaining the first convergence resultsfor this method in the case of diminished learning rate.

Improving SAGA via a Probabilistic Interpolation with Gradient Descent

Bibi, Adel; Sailanbayev, Alibek; Ghanem, Bernard; Gower, Robert Mansel; Richtarik, Peter (arXiv, 2018-06-14) [Preprint]

We develop and analyze a new algorithm for empirical risk minimization, which is the key paradigm for training supervised machine learning models. Our method---SAGD---is based on a probabilistic interpolation of SAGA and gradient descent (GD). In particular, in each iteration we take a gradient step with probability $q$ and a SAGA step with probability $1-q$. We show that, surprisingly, the total expected complexity of the method (which is obtained by multiplying the number of iterations by the expected number of gradients computed in each iteration) is minimized for a non-trivial probability $q$. For example, for a well conditioned problem the choice $q=1/(n-1)^2$, where $n$ is the number of data samples, gives a method with an overall complexity which is better than both the complexity of GD and SAGA. We further generalize the results to a probabilistic interpolation of SAGA and minibatch SAGA, which allows us to compute both the optimal probability and the optimal minibatch size. While the theoretical improvement may not be large, the practical improvement is robustly present across all synthetic and real data we tested for, and can be substantial. Our theoretical results suggest that for this optimal minibatch size our method achieves linear speedup in minibatch size, which is of key practical importance as minibatch implementations are used to train machine learning models in practice. This is the first time linear speedup in minibatch size is obtained for a variance reduced gradient-type method by directly solving the primal empirical risk minimization problem.

SupportNet: solving catastrophic forgetting in class incremental learning with support data

Li, Yu; Li, Zhongxiao; Ding, Lizhong; Yang, Peng; Hu, Yuhui; Chen,Wei; Gao, Xin (arXiv, 2018-06-08) [Preprint]

A plain well-trained deep learning model often does not have the ability to learn new knowledge without forgetting the previously learned knowledge, which is known as the catastrophic forgetting. Here we propose a novel method, SupportNet, to solve the catastrophic forgetting problem in class incremental learning scenario efficiently and effectively. SupportNet combines the strength of deep learning and support vector machine (SVM), where SVM is used to identify the support data from the old data, which are fed to the deep learning model together with the new data for further training so that the model can review the essential information of the old data when learning the new information. Two powerful consolidation regularizers are applied to ensure the robustness of the learned model. Comprehensive experiments on various tasks, including enzyme function prediction, subcellular structure classification and breast tumor classification, show that SupportNet drastically outperforms the state-of-the-art incremental learning methods and even reaches similar performance as the deep learning model trained from scratch on both old and new data. Our program is accessible at: https://github.com/lykaust15/SupportNet

The export option will allow you to export the current search results of the entered query to a file. Different formats are available for download. To export the items, click on the button corresponding with the preferred download format.

By default, clicking on the export buttons will result in a download of the allowed maximum amount of items. For anonymous users the allowed maximum amount is 50 search results.

To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.