PRIAG: Proximal Reweighted Incremental Aggregated Gradient Algorithm for Distributed Optimizations

Deng, Xiaoge; Sun, Tao; Liu, Feng; Huang, Feng

doi:10.1007/978-3-030-60245-1_34

Xiaoge Deng ORCID: orcid.org/0000-0003-0622-1202⁹,
Tao Sun⁹,
Feng Liu⁹ &
…
Feng Huang⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12452))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1499 Accesses

Abstract

Large-scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e., algorithms that leverage multiple workers for training. However, collecting the information from all workers in every iteration is sometimes expensive or even prohibitive. In this paper, we propose an iterative algorithm called proximal reweighted incremental aggregated gradient (PRIAG) for solving a class of nonconvex and nonsmooth problems, which are ubiquitous in machine learning tasks and distributed optimization problems. In each iteration, this algorithm just needs the information from one worker due to the incremental aggregated method. Combined with the reweighted technique, we only require an easy-to-calculate proximal operator to deal with the nonconvex and nonsmooth properties. Using the Lyapunov function analysis method, we prove that the PRIAG algorithm is convergent under some mild assumptions. We apply this approach to nonconvex nonsmooth problems and distributed optimization tasks. Numerical experiments on both synthetic and real data sets show that our algorithm can achieve comparative learning performance, but more efficiently, compared with previous nonconvex solvers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1–2), 91–129 (2013)
Article MathSciNet Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)
Google Scholar
Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)
Article MathSciNet Google Scholar
Chartrand, R., Yin, W.: Iteratively reweighted algorithms for compressive sensing. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3869–3872. IEEE (2008)
Google Scholar
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)
Article MathSciNet Google Scholar
Chen, X., Ng, M.K., Zhang, C.: Non-lipschitz $\ell _ p$-regularization and box constrained model for image restoration. IEEE Trans. Image Process. 21(12), 4709–4721 (2012)
Article MathSciNet Google Scholar
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4(4), 1168–1200 (2005)
Article MathSciNet Google Scholar
Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. In: Glowinski, R., Osher, S.J., Yin, W. (eds.) Splitting Methods in Communication, Imaging, Science, and Engineering. SC, pp. 115–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41589-5_4
Chapter Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Google Scholar
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Article MathSciNet Google Scholar
Duchi, J.C., Agarwal, A., Wainwright, M.J.: Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. control 57(3), 592–606 (2011)
Article MathSciNet Google Scholar
Figueiredo, M.A., Nowak, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007)
Article Google Scholar
Gasso, G., Rakotomamonjy, A., Canu, S.: Recovering sparse signals with a certain family of nonconvex penalties and DC programming. IEEE Trans. Signal Process. 57(12), 4686–4698 (2009)
Article MathSciNet Google Scholar
Giannakis, G.B., Kekatos, V., Gatsis, N., Kim, S.J., Zhu, H., Wollenberg, B.F.: Monitoring and optimization for power grids: a signal processing perspective. IEEE Signal Process. Mag. 30(5), 107–128 (2013)
Article Google Scholar
Gong, P., Ye, J., Zhang, C.: Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 895–903 (2012)
Google Scholar
Guo, F., Wen, C., Mao, J., Song, Y.D.: Distributed economic dispatch for smart grids with random wind power. IEEE Trans. Smart Grid 7(3), 1572–1583 (2015)
Article Google Scholar
Jacob, L., Obozinski, G., Vert, J.P.: Group lasso with overlap and graph lasso. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 433–440 (2009)
Google Scholar
Lai, M.J., Xu, Y., Yin, W.: Improved iteratively reweighted least squares for unconstrained smoothed $\ell _q$ minimization. SIAM J. Numeric. Anal. 51(2), 927–957 (2013)
Article Google Scholar
Lu, C., Wei, Y., Lin, Z., Yan, S.: Proximal iteratively reweighted algorithm with multiple splitting for nonconvex sparsity optimization. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)
Google Scholar
Lu, C., Zhu, C., Xu, C., Yan, S., Lin, Z.: Generalized singular value thresholding. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Lu, Z., Zhang, Y.: Schatten-p quasi-norm regularized matrix optimization via iterative reweighted singular value minimization. arXiv preprint arXiv:1401.0869 (2015)
Mateos, G., Bazerque, J.A., Giannakis, G.B.: Distributed sparse linear regression. IEEE Trans. Signal Process. 58(10), 5262–5276 (2010)
Article MathSciNet Google Scholar
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I: Basic Theory, vol. 330. Springer, Heidelberg (2006)
Book Google Scholar
Padakandla, A., Sundaresan, R.: Separable convex optimization problems with linear ascending constraints. SIAM J. Optim. 20(3), 1185–1204 (2010)
Article MathSciNet Google Scholar
Parikh, N., Boyd, S., et al.: Proximal algorithms. Found. Trends® Optim. 1(3), 127–239 (2014)
Google Scholar
Rabbat, M., Nowak, R.: Distributed optimization in sensor networks. In: Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks, pp. 20–27 (2004)
Google Scholar
Rabbat, M.G., Nowak, R.D.: Decentralized source localization and tracking [wireless sensor networks]. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 3, pp. iii–921. IEEE (2004)
Google Scholar
Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)
Google Scholar
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer, Heidelberg (2009)
MATH Google Scholar
Shi, W., Ling, Q., Wu, G., Yin, W.: A proximal gradient algorithm for decentralized composite optimization. IEEE Trans. Signal Process. 63(22), 6013–6023 (2015)
Article MathSciNet Google Scholar
Sun, T., Jiang, H., Cheng, L.: Convergence of proximal iteratively reweighted nuclear norm algorithm for image processing. IEEE Trans. Image Process. 26(12), 5632–5644 (2017)
Article MathSciNet Google Scholar
Sun, T., Jiang, H., Cheng, L.: Global convergence of proximal iteratively reweighted algorithm. J. Global Optim. 68(4), 815–826 (2017). https://doi.org/10.1007/s10898-017-0507-z
Article MathSciNet MATH Google Scholar
Sun, T., Jiang, H., Cheng, L., Zhu, W.: Iteratively linearized reweighted alternating direction method of multipliers for a class of nonconvex problems. IEEE Trans. Signal Process. 66(20), 5380–5391 (2018)
Article MathSciNet Google Scholar
Sun, T., Li, D., Jiang, H., Quan, Z.: Iteratively reweighted penalty alternating minimization methods with continuation for image deblurring. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3757–3761. IEEE (2019)
Google Scholar
Sun, T., Sun, Y., Li, D., Liao, Q.: General proximal incremental aggregated gradient algorithms: Better and novel results under general scheme. In: Advances in Neural Information Processing Systems, pp. 994–1004 (2019)
Google Scholar
Vanli, N.D., Gurbuzbalaban, M., Ozdaglar, A.: Global convergence rate of proximal incremental aggregated gradient methods. SIAM J. Optim. 28(2), 1282–1300 (2018)
Article MathSciNet Google Scholar
Weston, J., Elisseeff, A., Schölkopf, B., Tipping, M.: Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res. 3(Mar), 1439–1461 (2003)
MathSciNet MATH Google Scholar
Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2008)
Article Google Scholar

Download references

Acknowledgement

This research was funded in part by the Core Electronic Devices, High-end Generic Chips, and Basic Software Major Special Projects (No. 2018ZX01028101), and in part by the National Natural Science Foundation of China (No. 61907034, No. 61932001, and No. 61906200).

Author information

Authors and Affiliations

National Laboratory for Parallel and Distributed Processing (PDL), College of Computer, National University of Defense Technology, Changsha, 410073, China
Xiaoge Deng, Tao Sun, Feng Liu & Feng Huang

Authors

Xiaoge Deng
View author publications
You can also search for this author in PubMed Google Scholar
Tao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Feng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Feng Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Sun .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Meikang Qiu

Appendices

Appendix

Proof of Lemma 1

Note that $x^{k+1}$ is obtained by (7), combining with the Definition 1 and 2 we have

$$\begin{aligned} \frac{x^{k}-x^{k+1}}{\gamma }-v^{k} \in \partial w^k \cdot \tilde{g} \left( x^{k+1}\right) . \end{aligned}$$

(26)

With the convexity of g, we have

$$\begin{aligned} w^k \cdot (\tilde{g} \left( x^{k+1}\right) -\tilde{g} \left( x^{k}\right) )\le \left\langle \frac{x^{k}-x^{k+1}}{\gamma }-v^{k}, x^{k+1}-x^k \right\rangle . \end{aligned}$$

(27)

The assumption (b) implies that the sum function F is differentiable with L-continues gradient, i.e.,

$$\begin{aligned} \Vert \nabla F(w)-\nabla F(\overline{w})\Vert _{2} \le L \Vert w-\overline{w}\Vert _{2}, \end{aligned}$$

(28)

where $L=\sum _{i=1}^{m}L_i$. We can further get

$$\begin{aligned} F \left( x^{k+1}\right) -F \left( x^{k}\right) \le \left\langle \nabla F \left( x^{k}\right) , x^{k+1}-x^k \right\rangle + \frac{L}{2} \left\| x^{k+1}-x^k\right\| _{2}^{2}. \end{aligned}$$

(29)

Then we have

$$\begin{aligned} \begin{aligned}&\varPhi (x^{k+1})-\varPhi (x^{k})=F \left( x^{k+1}\right) -F \left( x^{k}\right) +\sum _{j=1}^{n} h \left( g\left( x_{j}^{k+1}\right) \right) -h \left( g\left( x_{j}^{k}\right) \right) \\&\le \left\langle \nabla F\left( x^{k}\right) , x^{k+1}-x^k \right\rangle +\frac{L}{2}\left\| x^{k+1}-x^k\right\| _{2}^{2}+\sum _{j=1}^{n} h \left( g\left( x_{j}^{k+1}\right) \right) -h\left( g\left( x_{j}^{k}\right) \right) \\&\le \left\langle \nabla F \left( x^{k}\right) , x^{k+1}-x^k \right\rangle +\frac{L}{2}\left\| x^{k+1}-x^k\right\| _{2}^{2}+\sum _{j=1}^{n} w_j^k \left( g\left( x_{j}^{k+1}\right) -g\left( x_{j}^{k} \right) \right) \\&= \left\langle \nabla F \left( x^{k}\right) , x^{k+1}-x^k \right\rangle +\frac{L}{2}\left\| x^{k+1}-x^k\right\| _{2}^{2}+w^k\cdot \left( \tilde{g} \left( x^{k+1}\right) -\tilde{g} \left( x^{k}\right) \right) \\&\le \left\langle \nabla F \left( x^{k}\right) , x^{k+1}-x^k \right\rangle + \frac{L}{2} \left\| x^{k+1}-x^k\right\| _{2}^{2}+\left\langle \frac{x^{k+1}-x^k}{\gamma }+v^{k},-x^{k+1}-x^k \right\rangle \\&=\underbrace{\left\langle \nabla F \left( x^{k}\right) -v^{k}, x^{k+1}-x^k \right\rangle }_{I} + \left( \frac{L}{2}-\frac{1}{\gamma } \right) \left\| x^{k+1}-x^k\right\| _{2}^{2}, \end{aligned} \end{aligned}$$

(30)

where $w_j^k:=(h^{\prime }(g(x_j^k))$ and $\tilde{g}=(g(x_1), g(x_2), \ldots , g(x_N))^{\top }$. The first inequality uses (29). The second inequality uses the concavity of h. The third inequality uses (27). In the following, we will give the bound of I. First, notice that $\text {max}_{i,k}\{\tau _{i,k}\}\le \tau $, then

$$\begin{aligned} \left\| x^{k}-x^{k-\tau _{i, k}}\right\| _2 \le \sum _{d=k-\tau _{i, k}}^{k-1} \left\| x^{d+1}-x^d\right\| _2 \le \sum _{d=k-\tau }^{k-1} \left\| x^{d+1}-x^d\right\| _2. \end{aligned}$$

(31)

Combining with (7), we have

$$\begin{aligned} \begin{aligned} I&=\left\langle \nabla F \left( x^{k}\right) -v^{k}, x^{k+1}-x^k \right\rangle \\&=\left\langle \sum _{i=1}^{m}\left( \nabla f_i \left( x^{k} \right) -\nabla f_i \left( x^{k-\tau _{i,k}} \right) \right) , x^{k+1}-x^k \right\rangle \\&\le \sum _{i=1}^{m} L_{i} \left\| x^{k}-x^{k-\tau _{i, k}}\right\| _2 \cdot \left\| x^{k+1}-x^k\right\| _2\\&\le \sum _{i=1}^{m} L_{i} \left( \sum _{d=k-\tau }^{k-1} \left\| x^{d+1}-x^d \right\| _2 \right) \cdot \left\| x^{k+1}-x^k \right\| _2\\&= L \sum _{d=k-\tau }^{k-1} \left\| x^{d+1}-x^d \right\| _2 \cdot \left\| x^{k+1}-x^k \right\| _2. \end{aligned} \end{aligned}$$

(32)

The first inequality uses the Lipschitz continuity of $\nabla f_{i}$. The second inequality uses (31). Meanwhile, for any $\xi >0$, we have the following Cauchy’s inequality

$$\begin{aligned} \left\| x^{d+1}-x^d\right\| _2 \cdot \left\| x^{k+1}-x^k\right\| _2 \le \frac{1}{2 \xi } \left\| x^{d+1}-x^d\right\| _{2}^{2}+\frac{\xi }{2}\left\| x^{k+1}-x^k \right\| _{2}^{2}. \end{aligned}$$

(33)

Then we have

$$\begin{aligned} I \le \frac{L}{2 \xi } \sum _{d=k-\tau }^{k-1} \left\| x^{d+1}-x^d\right\| _{2}^{2}+\frac{\tau \xi L}{2} \left\| x^{k+1}-x^k\right\| _{2}^{2}. \end{aligned}$$

(34)

Combining (30), (34), we have

$$\begin{aligned} \varPhi (x^{k+1})-\varPhi (x^{k}) \le \frac{L}{2 \xi } \sum _{d=k-\tau }^{k-1} \left\| x^{d+1}-x^d\right\| _{2}^{2}+\left[ \frac{(\tau \xi +1)L}{2}-\frac{1}{\gamma }\right] \left\| x^{k+1}-x^k\right\| _{2}^{2}. \end{aligned}$$

(35)

If $0<\gamma <\frac{2}{(2\tau +1)L}$, we can choose $\xi >0$, such that

$$\begin{aligned} \xi +\frac{1}{\xi }=1+\frac{1}{\tau }\left( \frac{1}{\gamma L}-\frac{1}{2}\right) . \end{aligned}$$

(36)

Then, with direct calculations and substitutions, we have

$$\begin{aligned} \begin{aligned}&\varGamma _{k}(\xi )-\varGamma _{k+1}(\xi ) \\&=\varPhi \left( x^{k}\right) -\varPhi \left( x^{k+1}\right) +\frac{L}{2 \xi }\sum _{d=k-\tau }^{k-1}(d-(k-\tau )+1)\left\| x^{d+1}-x^d \right\| _{2}^{2}\\&\ \ \ -\frac{L}{2\xi }\sum _{d=k+1-\tau }^{k}(d-(k-\tau ))\left\| x^{d+1}-x^d\right\| _{2}^{2}\\&= \varPhi \left( x^{k}\right) -\varPhi \left( x^{k+1}\right) +\frac{L}{2 \xi }\sum _{d=k-\tau }^{k-1}(d-(k-\tau )+1)\left\| x^{d+1}-x^d\right\| _{2}^{2}\\&\ \ \ -\frac{L}{2 \xi }\sum _{d=k-\tau }^{k-1}(d-(k-\tau ))\left\| x^{d+1}-x^d\right\| _{2}^{2}-\frac{L}{2 \xi } \tau \left\| x^{k+1}-x^k \right\| _{2}^{2}\\&= \varPhi \left( x^{k}\right) -\varPhi \left( x^{k+1}\right) +\frac{L}{2 \xi }\sum _{d=k-\tau }^{k-1} \left\| x^{d+1}-x^d\right\| _{2}^{2}-\frac{L}{2\xi } \tau \left\| x^{k+1}-x^k \right\| _{2}^{2}\\&\ge (\frac{1}{\gamma }-\frac{(\tau \xi +1)L}{2}-\frac{L}{2\xi }\tau ) \left\| x^{k+1}-x^k\right\| _{2}^{2}\\&=\frac{1}{4} \left( \frac{1}{\gamma }-\frac{L}{2}-\tau L\right) \left\| x^{k+1}-x^k\right\| _{2}^{2}. \end{aligned} \end{aligned}$$

(37)

The first inequality uses (35). The last equation uses (36). We then prove the first result. By summing the inequality (37), we have:

$$\begin{aligned} \sum _{k=1}^{\infty }\left\| x^{k+1}-x^k \right\| _{2}^{ 2 }<\infty . \end{aligned}$$

(38)

The second then obviously holds. Using [Lemma 3, [9]], we are then led to

$$\begin{aligned} \min _{1\le i\le k}\left\| x^{i+1}-x^i \right\| _{2}^2 =o\left( \frac{1}{k}\right) , \end{aligned}$$

(39)

which directly derives the third one.

Proof of Theorem 1

By the definition of subdifferential, we have

$$\begin{aligned} \frac{x^{k}-x^{k+1}}{\gamma }-v^{k} \in \partial w^k \tilde{g} \left( x^{k+1} \right) . \end{aligned}$$

(40)

That means

$$\begin{aligned} \begin{aligned} \frac{x^{k}-x^{k+1}}{\gamma }+\nabla F \left( x^{k+1}\right) -v^{k} \in \nabla F \left( x^{k+1}\right) +\partial H \left( x^{k+1}\right) =\partial \varPhi \left( x^{k+1}\right) , \end{aligned} \end{aligned}$$

(41)

where $H(x):=\sum _{j=1}^{n} h\left( g\left( x_{j}\right) \right) $. Thus we have

$$\begin{aligned} \begin{aligned} {\text {dist}}^2 \left( \mathbf {0}, \partial \varPhi \left( x^{k+1}\right) \right)&= \Vert \frac{x^{k}-x^{k+1}}{\gamma }+\nabla F \left( x^{k+1}\right) -v^k\Vert _2^2\\&\le \frac{2\left\| x^{k+1}-x^k \right\| _2^2}{\gamma ^2}+2L^2\tau \sum _{d=k-\tau }^{k} \left\| x^{d+1}-x^d\right\| _2^2. \end{aligned} \end{aligned}$$

(42)

Combining with Lemma 1,

$$\begin{aligned} \sum _{k}{\text {dist}}^2\left( \mathbf {0}, \partial \varPhi \left( x^{k+1}\right) \right) <+\infty . \end{aligned}$$

(43)

Still using [Lemma 3, [9]], the result can be proved.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deng, X., Sun, T., Liu, F., Huang, F. (2020). PRIAG: Proximal Reweighted Incremental Aggregated Gradient Algorithm for Distributed Optimizations. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-60245-1_34
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60244-4
Online ISBN: 978-3-030-60245-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics