Elsevier

Knowledge-Based Systems

Volume 252, 27 September 2022, 109312
Knowledge-Based Systems

Convergence analysis of asynchronous stochastic recursive gradient algorithms

https://doi.org/10.1016/j.knosys.2022.109312Get rights and content

Abstract

Asynchronous stochastic algorithms with variance reduction techniques have been empirically shown to be useful for many large-scale machine learning problems. By making a parallel optimization algorithm asynchronous, one can reduce the synchronization cost and improve the practical efficiency. Recently, the stochastic recursive gradient algorithm has shown superior theoretical performance; however, it is not scalable enough in the current big data era. To make it more practical, we propose a class of asynchronous stochastic recursive gradient methods and analyze them in the shared memory model. The analysis results show that our asynchronous algorithms can linearly converge to the solution in the strongly convex case and complete the iteration faster. In addition, we analyze the “price of asynchrony” and give the sufficient conditions required for linear speedup. To the best of our knowledge, our speedup conditions match the optimal bounds of asynchronous stochastic algorithms known thus far. Finally, we demonstrate our theoretical analyses from a practical implementation on a multicore machine.

Introduction

In this paper, we consider the following optimization problem: minwRmf(w)=1ni=1nfi(w),where each fi(w):RmR,i[n]={1,,n}, is convex and has a Lipschitz continuous gradient. This type of problem arises frequently in supervised learning. For example, given a set of training examples {(xi,yi)}i=1n,xiRm,yiR, the least squares regression model can be written as (1) with fi(w)=(xi,wyi)2. The logistic regression for binary classification can be written as (1) with fi(w)=log(1+exp(yixi,w)), (yi{1,+1}).

Many advanced optimization algorithms have been developed for problem (1). The stochastic gradient descent (SGD) originating from [1] is perhaps the most common approach. In each iteration, SGD randomly picks an index it[n], and updates the iterate as wt+1=wtηtfit(wt) with an appropriate step size of ηt. In recent years, a set of semistochastic methods have been proposed to reduce the variance with the help of the specific finite-sum form of (1), such as SAG [2], SAGA [3], SVRG [4], SDCA [5], and S2GD [6]. It has been demonstrated that these variance-reduced methods achieve a linear convergence rate faster than SGD in the strongly convex case. In late 2017, the stochastic recursive gradient algorithm (named SARAH or SPIDER) was proposed in [7], [8], which exhibits some better theoretical properties than previous variance-reduced methods. For example, the stochastic recursive gradient method takes greater advantage of up-to-date information, and the stochastic gradient estimator in the inner loop of SARAH/SPIDER linearly diminishes to zero.

Randomized coordinate descent (RCD) is a second line of research for solving problem (1). In each iteration, a randomly selected block of coordinates is required to update as wt+1=wtηtjtf(wt)ejt, where jt is randomly chosen from [m]={1,2,,m}, and ejt is a natural basis vector with “1” in the jth position and “0” otherwise. Since the computation of the coordinate directional derivatives can be much faster for many coordinate-friendly problems [9], RCD and its many variants achieve much better computational performance in this case. Meanwhile, there has been a surging research interest in merging SGD and RCD together, such as MRBCD [10], ASBCD [11], ADSG [12], and AsyDSPG+ [13]. Essentially, these methods randomly sample a block of coordinates based on randomly sampled training data and utilize variance reduction techniques, which avoid full dataset access and full dimension operations in each iteration.

As mentioned previously, many serial algorithms have recently been developed and analyzed for problem (1), although they often suffer from scalability issues in the current big data era. It is well known that asynchronous parallelism has become a popular way to speed up machine learning algorithms using multiple processors. Considering that the stochastic recursive gradient method has shown superior performance both in theory and in practice, this paper focuses on asynchronous stochastic recursive gradient algorithms. However, appropriately guaranteeing a successful implementation remains an open question, although many asynchronous stochastic variance-reduced algorithms have been proposed and analyzed in recent years. The nontriviality lies in the integrated analysis of the recursive iteration used in an asynchronous implementation.

Contributions. To address this issue, we propose the asynchronous stochastic recursive gradient (ASRG) method and the asynchronous stochastic coordinate recursive gradient (ASCRG) method, which first incorporates RCD into the stochastic recursive gradient method. We conduct theoretical analysis for our asynchronous algorithms. The results show that both our proposed methods can linearly converge to the solution of problem (1), which allows constant step sizes and completes faster iterations. Specifically, ASRG achieves a linear speedup when the problem is sparse and the time delay τO(1/Δκ); ASCRG can obtain a linear speedup whether the problem is sparse or not as long as the time delay τO(m1/2). When the sufficient conditions of the linear speedup are satisfied, “the price of asynchrony” (i.e., the error induced by reading out-of-date information) becomes asymptotically negligible. To the best of our knowledge, our speedup conditions match the optimal bounds of asynchronous stochastic algorithms known thus far.

We use bold lowercase letters for vectors, e.g., w, and lowercase letters for scalars, e.g., m. We define ejRm as the natural basis vector (0,,1,,0)T with “1” in the jth position and “0” otherwise. is the Euclidean norm 2. We use (v)j for the jth element of vector v, (v)j for the rest except the jth element, and jfi(w) for the jth element of the gradient of the function fi(w) at the point w. w denotes the minimum point of the objective function f(w). We call Δ the problem sparsity, which is the largest frequency of a feature appearing in the data matrix.1 For two nonnegative sequences {an} and {bn}, we say an=O(bn), if there exists a constant 0<C<+ such that anCbn.

Section snippets

Related work

In this section, we briefly review some asynchronous stochastic variance-reduced algorithms that are closely related to ours. Ref. [15] first proposes asynchronous versions of SVRG and SAGA. They present the speedup condition, i.e., τO(1/Δ)12, but their results rely on the consistent read assumption. Ref. [16] introduces two asynchronous algorithms, Async-ProxSVRG and Async-ProxSVRCD: Async-ProxSVRG can obtain a linear speedup with a condition of τO(1/Δ)14; Async-ProxSVRCD does not require

Preliminaries

Consider an asynchronous shared memory parallel model such as multicore processors and GPU-accelerators, on which all p processors have read and write access to a shared memory in an asynchronous and lock-free fashion. For asynchronous algorithms, multiple processors continuously run the following gradient descent steps:

R: Read the global variable w from the shared memory and evaluate a stochastic gradient estimator based on w;

W: Update w along the direction of the stochastic gradient estimator

ASRG algorithm

The ASRG algorithm is presented in Algorithm 1, which has two-layer loops. Taking the sth epoch as an example, the outer layer computes the full gradient v0s=f(w0s) in parallel and then proceeds by updating w1s=w0sηv0s. In the inner layer, each processor randomly chooses an index its[n] and executes the update as follows: vts=fits(ŵts)fits(ŵt1s)+vt1s,wt+1s=wtsηvts, where ŵts and wts are the global variables to be read and updated at the tth iteration respectively. The superscript s

Convergence analysis

In this section, we first give several basic assumptions and then provide the convergence analysis of our ASRG and ASCRG algorithms. We defer all proofs in this section to Appendix.

Comparison & discussion

We present the sufficient conditions of the linear speedup of several benchmark asynchronous algorithms in Table 1. By comparing the speedup conditions, we find that the speedup condition mainly depends on two crucial parameters, i.e., the problem sparsity Δ and/or the problem dimension m. It follows that ASRG relies on the problem sparsity, while ASCRG does not rely on the problem sparsity but requires that the dimension m is large enough. The speedup conditions of our methods approximately

Numerical experiments

In this section, we conduct experiments to validate the effectiveness of our asynchronous algorithms. In the experiments, we focus on the regularized logistic regression (LR) problem: minwRmf(w)=1ni=1nlog1+exp(yixi,w)+λ2w2,and the regularized least square (LS) problem: minwRmf(w)=12ni=1nxi,wyi2+λ2w2,given a set of training datasets {xi,yi}i=1n, xiRm, and yi{1,+1}. We set λ=0.0001 in all experiments. We use three large datasets from LIBSVM [24]: news20.binary (n = 19,996; m =

Conclusions

In this paper, we propose two asynchronous algorithms, ASRG and ASCRG. We prove that our algorithms linearly converge to the minimum point of convex problem (1). We present the sufficient conditions that are required for linear speedup. Furthermore, we give intuitive explanations that reveal why the speedup condition depends on the dimension and/or problem sparsity.

CRediT authorship contribution statement

Pengfei Wang: Conceptualization, Methodology, Software, Writing – original draft. Nenggan Zheng: Supervision, Project administration, Resources, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key R&D Program of China (2020YFB1313501), the Key R&D program of Zhejiang Province, China (2021C03003,2022C01022,2022C01119), the Zhejiang Provincial Natural Science Foundation, China (LR19F020005), the National Natural Science Foundation of China (61972347), and the the Fundamental Research Funds for the Central Universities, China (No.226-2022-00051).

References (25)

  • RobbinsH. et al.

    A stochastic approximation method

    Ann. Math. Stat.

    (1951)
  • SchmidtM. et al.

    Minimizing finite sums with the stochastic average gradient

    Math. Program.

    (2017)
  • DefazioA. et al.

    SAGA: A Fast incremental gradient method with support for non-strongly convex composite objectives

  • JohnsonR. et al.

    Accelerating stochastic gradient descent using predictive variance reduction

  • Shalev-ShwartzS. et al.

    Stochastic dual coordinate ascent methods for regularized loss minimization

    J. Mach. Learn. Res.

    (2012)
  • KonečnỳJ. et al.

    Mini-batch semi-stochastic gradient descent in the proximal setting

    IEEE J. Sel. Top. Sign. Proces.

    (2015)
  • NguyenL.M. et al.

    SARAH: A novel method for machine learning problems using stochastic recursive gradient

  • FangC. et al.

    SPIDER: NEar-optimal non-convex optimization via stochastic path-integrated differential estimator

  • PengZ. et al.

    Arock: an algorithmic framework for asynchronous parallel coordinate updates

    SIAM J. Sci. Comput.

    (2016)
  • ZhaoT. et al.

    Accelerated mini-batch randomized block coordinate descent method

  • A. Zhang, Q. Gu, Accelerated stochastic block coordinate descent with optimal sampling, in: Proceedings of the 22nd ACM...
  • ShenZ. et al.

    Accelerated doubly stochastic gradient algorithm for large-scale empirical risk minimization

  • View full text