# Mathematical Sciences Research Institute

Home » Workshop » Schedules » From an ODE to accelerated stochastic gradient descent: convergence rate and empirical results

# From an ODE to accelerated stochastic gradient descent: convergence rate and empirical results

## [Moved Online] Hot Topics: Optimal transport and applications to machine learning and statistics May 04, 2020 - May 08, 2020

May 07, 2020 (11:00 AM PDT - 12:00 PM PDT)
Location: MSRI: Online/Virtual
Tags/Keywords

• convex optimization

• accelerated algorithms

• Nesterov's method

Primary Mathematics Subject Classification
Secondary Mathematics Subject Classification
Video

#### From An ODE To Accelerated Stochastic Gradient Descent: Convergence Rate And Empirical Results

Abstract

Su, Boyd and Candés showed that Nesterov's method for accelerated gradient descent can be obtained as the discretization of an Ordinary Differential Equation (ODE). By discretizing a modified ODE we show that : (i) we recover Nesterov's method for convex functions using a constant learning rate, (ii) choosing a decreasing learning rate leads to an accelerated stochastic gradient descent algorithm. The learning rate for the stochastic algorithm is of order k^(3/4), which is novel. We also study a second ODE and obtain a corresponding algorithm for the strongly convex case.

Convergence rates for the algorithms are established using a Lyapunov function approach which unifies the continuous time, deterministic gradient, and stochastic gradient cases. The stochastic algorithms are accelerated (in the convex case, and stochastic approximation setting), in the sense that the rate constant improves over existing results. Existing results have a rate constant which depends on G^2, the bound on the norm of the stochastic gradient. The accelerated algorithm rate constant depends on the smaller constant depends on \sigma^2, the variance of the stochastic gradient.

Numerical results in both the stochastic approximation and the finite sum setting, demonstrate that the accelerated algorithms can converge faster than SGD, or momentum SGD algorithms. Moreover, the algorithms are easily tuned, in the sense that empirical results are robust to changes in the learning rate schedule and constants. This is joint work with Maxime Laborde.

Supplements