Block Mirror Stochastic Gradient Method For Stochastic Optimization

Yang, Jinda; Song, Haiming; Li, Xinxin; Hou, Di

doi:10.1007/s10915-023-02110-y

Block Mirror Stochastic Gradient Method For Stochastic Optimization

Published: 08 February 2023

Volume 94, article number 69, (2023)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

Jinda Yang¹,
Haiming Song¹,
Xinxin Li ORCID: orcid.org/0000-0002-6392-701X¹ &
…
Di Hou²

427 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, a block mirror stochastic gradient method is developed to solve stochastic optimization problems involving convex and nonconvex cases, where the feasible set and the variables are treated as multiple blocks. The proposed method combines the features of the classic mirror descent stochastic method and the block coordinate gradient descent method. Acquiring the stochastic gradient information by stochastic oracles, our method updates all the blocks of variables in the Gauss–Seidel type. We establish the convergence for both convex and nonconvex cases. The analysis of our method is challenging because the typical unbiasedness assumption of stochastic gradient fails to hold in the Gauss–Seidel renewal type and requires more specific assumptions. The proposed algorithm is tested on the conditional value-at-risk problem and the stochastic LASSO problem to demonstrate the efficiency of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Notes

Downloaded from http://www.resset.cn/.
Downloaded from https://archive.ics.uci.edu/ml/datasets/.

References

Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, 8194–8244 (2017)
MathSciNet MATH Google Scholar
Bregman, L.M.: The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. Comput. Math. Math. Phys. 7, 200–217 (1967)
Article MathSciNet Google Scholar
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)
Article MathSciNet MATH Google Scholar
Beck, A.: First-Order Methods in Optimization. Society for Industrial and Applied Mathematics (2017)
Buza, K.: Feedback prediction for blogs. In: Data Analysis, Machine Learning and Knowledge Discovery, pp. 145–152. Springer (2014)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inf. Process. Syst. 2, 1646–1654 (2014)
Google Scholar
Dang, C.D., Lan, G.: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Optim. 25, 856–881 (2015)
Article MathSciNet MATH Google Scholar
Fu, M: Optimization for simulation: theory vs. practice. INFORMS J. Comput. 14, 192–215 (2002)
Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34, A1380–A1405 (2012)
Article MathSciNet MATH Google Scholar
Glasserman, P.: Gradient Estimation via Perturbation Analysis. Kluwer, Boston (2003)
MATH Google Scholar
Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 69–77 (2011)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework. SIAM J. Optim. 22, 1469–1492 (2012)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23, 2061–2089 (2013)
Article MathSciNet MATH Google Scholar
Hildreth, C.: A quadratic programming procedure. Nav. Res. Logist. Q. 4, 79–85 (1957)
Article MathSciNet Google Scholar
Juditsky, A., Nemirovski, A.S.: First order methods for nonsmooth convex large-scale optimization, I: general purpose methods. Optim. Mach. Learn. 30(9), 121–148 (2011)
Google Scholar
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133, 365–397 (2012)
Article MathSciNet MATH Google Scholar
Lan, G.: Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Math. Program. 149, 1–45 (2015)
Article MathSciNet MATH Google Scholar
Lan, G.: First-Order and Stochastic Optimization Methods for Machine Learning. Springer (2020)
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. Adv. Neural Inf. Process. Syst. 2, 3384–3392 (2015)
Google Scholar
Linderoth, J., Shapiro, A., Wright, S.: The empirical behavior of sampling methods for stochastic programming. Ann. Oper. Res. 142, 215–241 (2006)
Article MathSciNet MATH Google Scholar
Lu. Z., Xiao. L.: Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. arXiv preprint. arXiv:1306.5918 (2013)
Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Google Scholar
Nemirovski, A.S., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. A. SIAM J. Optim. 19, 1574–1609 (2009)
Article MathSciNet MATH Google Scholar
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč. M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, pp. 2613–2621 (2017)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004)
Book MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22, 341–362 (2012)
Article MathSciNet MATH Google Scholar
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. arXiv preprint, arXiv:1902.05679 (2019)
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5, 201–226 (2013)
Article MathSciNet MATH Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet MATH Google Scholar
Rockafellar, R., Wets, J.: Variational Analysis. Springer, Berlin (1998)
Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method. Wiley (1993)
Shapiro, A., Nemirovski, A.: On complexity of stochastic programming problems. In: Jeyakumar, V., Rubinov, A.M. (eds.) Continuous Optimization: Current Trends and Applications, pp. 111–144. Springer, New York (2005)
Chapter Google Scholar
Sospedra, J.T., Montoliu, R., et al.: UJIIndoorLoc: a new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: Proceedings of the Fifth International Conference on Indoor Positioning and Indoor Navigation (2014)
Schmidt, M., Roux, N.L., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)
Article MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016)
Article MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter. A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127, 3–30 (2011)
Sun, R., Ye, Y.: Worst-case complexity of cyclic coordinate descent: $O (n^2)$ gap with randomized version. Math. Program. 185, 487–520 (2021)
Article MathSciNet MATH Google Scholar
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)
Article MathSciNet MATH Google Scholar
Verweij, B., Ahmed, S., Kleywegt, A.J., Nemhauser, G., Shapiro, A.: The sample average approximation method applied to stochastic routing problems: a computational study. Comput. Optim. Appl. 24, 289–333 (2003)
Article MathSciNet MATH Google Scholar
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24, 2057–2075 (2014)
Article MathSciNet MATH Google Scholar
Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25, 1686–1716 (2015)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank Prof. Chengbo Yang for discussing the conditional value-at-risk problem in Sect. 4. The authors also sincerely thank the anonymous referees for their valuable comments and suggestions, which helped improve the manuscript significantly.

Funding

The research is partly supported by the National Key Research and Development Program of China (2020YFA0714101), NSFC (11701210, 11601183, 61872162, 12171199), the Education Department Project of Jilin Province (JJKH20211031KJ), the Science and Technology Department of Jilin Province (20180520212JH, 20190103029JH, 20200201269JC, 20210201015GX), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

School of Mathematics, Jilin University, Changchun, 130012, China
Jinda Yang, Haiming Song & Xinxin Li
Department of Mathematics, National University of Singapore, Singapore, 119076, Singapore
Di Hou

Authors

Jinda Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haiming Song
View author publications
You can also search for this author in PubMed Google Scholar
Xinxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Di Hou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Haiming Song or Xinxin Li.

Ethics declarations

Conflict of interest

All the authors declare they have no financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs of some lemmas in this paper

Proof of Lemma 1

From the strong convexity of $\omega _{i}$, it holds

$$\begin{aligned} \begin{aligned} \omega _{i}(\textbf{x}_{i})-\omega _{i}(\textbf{y}_{i}) \ge&\langle \nabla \omega _{i}(\textbf{y}_{i}),\textbf{x}_{i}-\textbf{y}_{i}\rangle +\frac{\alpha _{i}}{2}\Vert \textbf{y}_{i}-\textbf{x}_{i}\Vert _{\mathcal {E}_i}^{2} \\ \ge&-\frac{1}{2\alpha _{i}}\left\| \nabla \omega _{i}(\textbf{y}_{i}) \right\| _{\mathcal {E}_i,*}^{2}, \end{aligned} \end{aligned}$$

where the second inequality follows from (6). It implies

$$\begin{aligned} \omega _{i}(\textbf{y}_{i})-\omega _{i}(\textbf{x}_{i}) \le \frac{1}{2\alpha _{i}}\left\| \nabla \omega _{i}(\textbf{y}_{i}) \right\| _{\mathcal {E}_i,*}^{2}. \end{aligned}$$

Let us consider the function $\varphi _{\textbf{x}_{i}}(\textbf{z})=\omega _{i}(\textbf{z}) -\langle \nabla \omega _{i}(\textbf{x}_{i}),\textbf{z}\rangle $. It is easy to see that $\varphi _{\textbf{x}_{i}}(\textbf{z})$ is the strongly convex function with the same parameter $\alpha _{i}$ since

$$\begin{aligned} \begin{aligned} \langle \nabla \varphi _{\textbf{x}_{i}}(\textbf{z}_{1}) -\nabla \varphi _{\textbf{x}_{i}}(\textbf{z}_{2}),\textbf{z}_{1}-\textbf{z}_{2}\rangle =&\langle \nabla \omega _{i}(\textbf{z}_{1})-\nabla \omega _{i}(\textbf{z}_{2}),\textbf{z}_{1}-\textbf{z}_{2}\rangle \\ \ge&\alpha _{i}\Vert \textbf{z}_{1}-\textbf{z}_{2}\Vert _{\mathcal {E}_i,*}^{2}. \end{aligned} \end{aligned}$$

Then we have

$$\begin{aligned} \begin{aligned} \omega _{i}(\textbf{y}_{i})-\omega _{i}(\textbf{x}_{i})-\langle \nabla \omega _{i}(\textbf{x}_{i}),\textbf{y}_{i}-\textbf{x}_{i}\rangle =&\varphi _{\textbf{x}_{i}}(\textbf{y}_{i})-\varphi _{\textbf{x}_{i}}(\textbf{x}_{i}) \\ \le&\frac{1}{2\alpha _{i}}\left\| \nabla \varphi _{\textbf{x}_{i}}(\textbf{y}_{i}) \right\| _{\mathcal {E}_i,*}^{2}\\ =&\frac{1}{2\alpha _{i}} \left\| \nabla \omega _{i}(\textbf{y}_{i})-\nabla \omega _{i}(\textbf{x}_{i}) \right\| _{\mathcal {E}_i,*}^{2}, \end{aligned} \end{aligned}$$

which gives the result. $\square $

Proof of Lemma 2

Note that $\sum _{t=1}^{T-1}\textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k})$ is independent of $\textbf{G}_{i}(\textbf{x}^{k},\xi _{T}^{k}) - \textbf{g}_{i}(\textbf{x}^{k})$ on $\{\xi _{t}^{k}\}_{t=1}^{T-1}$ for any $T \ge 2$. This together with (2) yields that

$$\begin{aligned} {\mathbb {E}}\left[ \langle \sum _{t=1}^{T-1} \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) -\textbf{g}_{i}(\textbf{x}^{k}), \textbf{G}_{i}(\textbf{x}^{k},\xi _{T}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \rangle | \{\xi _{t}^{k}\}_{t=1}^{T-1} \right] = 0,\quad \forall ~T\ge 2. \end{aligned}$$

Then for Euclidean norm $\Vert \cdot \Vert $, we have

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left[ \left\| \sum _{t=1}^{T_{k}}\textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2}\right] =&{\mathbb {E}}\left[ \left\| \sum _{t=1}^{T_{k}-1}\textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] \\&+{\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}^{k},\xi _{T_{k}}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] \\ =&\sum _{t=1}^{T_{k}}{\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] . \end{aligned} \end{aligned}$$

Under Assumption 1, it implies that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \bar{\textbf{G}}_{i}^{k} - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] =\frac{1}{T_{k}^2} {\mathbb {E}}\left[ \left\| \sum _{t=1}^{T_{k}}\textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] \le \frac{\sigma _{i}^{2}}{T_{k}}. \end{aligned}$$

Then the equivalence relation between the norm $\Vert \cdot \Vert _{\mathcal {E}_i,*}$ and the norm $\Vert \cdot \Vert $ on ${\mathbb {R}}^{n_i}$ completes the proof. $ \square $

Proof of Lemma 3

By Remark 2, it can be seen from (10), (11), and (14) that for any i and k

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}^{k},\xi ) \right\| _{\mathcal {E}_i,*}^{2} \right] \le&2\left\| \textbf{g}_{i}(\textbf{x}^{k}) \right\| _{\mathcal {E}_i,*}^{2} +2{\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}^{k},\xi ) -\textbf{g}_{i}(\textbf{x}^{k}) \right\| _{\mathcal {E}_i,*}^{2} \right] \\ \le&2\left( 2\left\| \textbf{g}_{i}(\textbf{x}^{k})-\textbf{g}_{i}(\textbf{x}^{1}) \right\| _{\mathcal {E}_i,*}^{2} +2\Vert \textbf{g}_{i}(\textbf{x}^{1})\Vert _{\mathcal {E}_i,*}^{2} \right) +2\sigma _{i}^{2},\\ \le&4L_{i}^{2}\Vert \textbf{x}^{k}-\textbf{x}^{1}\Vert _{\mathcal {E}_i}^{2} +4\left\| \textbf{g}_{i}(\textbf{x}^{1}) \right\| _{\mathcal {E}_i,*}^{2}+2\sigma _{i}^{2}\\ \le&16L_{i}^{2}\rho ^{2}+4\left\| \textbf{g}_{i}(\textbf{x}^{1}) \right\| _{\mathcal {E}_i,*}^{2} +2\sigma _{i}^{2}\\ \le&M_{i}^{2} + 2\sigma _{i}^{2}.\\ \end{aligned} \end{aligned}$$

Also it holds by Lemma 2 that

$$\begin{aligned} {\mathbb {E}}\left[ \Vert \bar{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2} \right] \le 2\left\| \textbf{g}_{i}(\textbf{x}^{k}) \right\| _{\mathcal {E}_i,*}^{2} +\frac{2c_{i}^{2}\sigma _{i}^{2}}{T_{k}} \le M_{i}^{2} + \frac{2c_{i}^{2}\sigma _{i}^{2}}{T_{k}}. \end{aligned}$$

We complete the proof of Lemma 3. $\square $

Proof of Lemma 4

By (8) and the optimality of $\textbf{x}_{i}^{k+1}$, we have

$$\begin{aligned} \langle \tilde{\textbf{G}}_{i}^{k}+\frac{1}{\gamma _{i}^{k}}(\nabla \omega _{i}(\textbf{x}_{i}^{k+1}) -\nabla \omega _{i}(\textbf{x}_{i}^{k})),\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\rangle +r_{i}(\textbf{x}_{i}^{k})-r_{i}(\textbf{x}_{i}^{k+1})\ge 0. \end{aligned}$$

Because of the strong convexity of $\omega _{i}$, we obtain

$$\begin{aligned} \begin{aligned} \langle \tilde{\textbf{G}}_{i}^{k},\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\rangle +r_{i}(\textbf{x}_{i}^{k})-r_{i}(\textbf{x}_{i}^{k+1}) \ge&\frac{1}{\gamma _{i}^{k}}\langle \nabla \omega _{i}(\textbf{x}_{i}^{k}) -\nabla \omega _{i}(\textbf{x}_{i}^{k+1}),\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\rangle \\ \ge&\frac{\alpha _{i}}{\gamma _{i}^{k}}\Vert \textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}^{2}. \end{aligned} \end{aligned}$$

On the other side, it holds that by Assumption 3

$$\begin{aligned} \langle \tilde{\textbf{G}}_{i}^{k},\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\rangle +r_{i}(\textbf{x}_{i}^{k})-r_{i}(\textbf{x}_{i}^{k+1}) \le \left( \Vert \tilde{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}+L_{r_i} \right) \Vert \textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}. \end{aligned}$$

Then, we obtaind that

$$\begin{aligned} \Vert \textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i} \le \frac{\gamma _{i}^{k}}{\alpha _{i}} \left( \Vert \tilde{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}+L_{r_i} \right) . \end{aligned}$$

In addition, under Assumption 2, for any $\xi _{t}^{k} \in \Xi ^{k} \subseteq \Xi $, it holds that

$$\begin{aligned} \begin{aligned} \left\| \textbf{G}_{i}(\textbf{x}_{<i}^{k+1},\textbf{x}_{\ge i}^{k},\xi _{t}^{k}) - \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2}&\le L_{i}(\xi _{t}^{k})^{2}\left( \sum _{j<i}\Vert \textbf{x}_{j}^{k+1}-\textbf{x}_{j}^{k}\Vert _{\mathcal {E}_{j}}^{2}\right) \\&\le 2L_{i}(\xi _{t}^{k})^{2}\left( \sum _{j<i}\left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2} \left( \Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2}+L_{r_j}^{2}\right) \right) \\&= 2\sum _{j<i}\left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2}\left( L_{i}(\xi _{t}^{k})^{2}\Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2} +L_{i}(\xi _{t}^{k})^{2}L_{r_j}^{2}\right) . \\ \end{aligned} \end{aligned}$$

Note that

$$\begin{aligned} L_{i}(\xi _{t}^{k})^{2}\Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2}\le \frac{1}{T_k}\sum _{s=1}^{T_k} L_{i}(\xi _{t}^{k})^{2}\left\| \textbf{G}_{j}(\textbf{x}_{<j}^{k+1},\textbf{x}_{\ge j}^{k},\xi _{s}^{k}) \right\| _{\mathcal {E}_j,*}^{2}, \end{aligned}$$

(68)

then by the uncorrelated condition, we have

$$\begin{aligned} {\mathbb {E}}\left[ L_{i}(\xi _{t}^{k})^{2}\Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2} \right] \le \frac{1}{T_k}\sum _{s=1}^{T_k}L_{i}^{2}{\mathbb {E}}\left[ \left\| \textbf{G}_{j}(\textbf{x}_{<j}^{k+1},\textbf{x}_{\ge j}^{k},\xi _{s}^{k}) \right\| _{\mathcal {E}_j,*}^{2} \right] . \end{aligned}$$

(69)

By using the above observation and an auxiliary notation $\Lambda $ denoted by

$$\begin{aligned} \Lambda = \max _{j} {\mathbb {E}}\left[ \left\| \textbf{G}_{j}(\textbf{x}_{<j}^{k+1},\textbf{x}_{\ge j}^{k},\xi ) \right\| _{\mathcal {E}_j,*}^{2}\right] , \end{aligned}$$

we get that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}_{<i}^{k+1},\textbf{x}_{\ge i}^{k},\xi _{t}^{k}) - \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2} \right] \le 2mL^{2}\left( \Lambda +L_{r}^{2}\right) \left( \frac{\gamma _{max}^{k}}{\alpha }\right) ^{2}. \end{aligned}$$

(70)

Also, observe that

$$\begin{aligned} \left\| \textbf{G}_{i}(\textbf{x}_{<i}^{k+1},\textbf{x}_{\ge i}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2} \le 2\left\| \textbf{G}_{i}(\textbf{x}_{<i}^{k+1},\textbf{x}_{\ge i}^{k},\xi _{t}^{k}) - \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2} + 2\left\| \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2}, \end{aligned}$$

and by the same argument in the proof of Lemma 3 and the definition of $\Lambda $, it implies that

$$\begin{aligned} \Lambda \le \frac{4mL^{2}L_{r}^{2}(\gamma _{max}^{k})^{2} + 2\alpha ^{2}(M^{2}+2\sigma ^{2})}{\alpha ^{2}-4mL^{2}(\gamma _{max}^{k})^{2}} = \delta (\gamma _{max}^{k}). \end{aligned}$$

(71)

Moreover, it is clear that by Assumption 2

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}\left[ \Vert \tilde{\textbf{G}}_{i}^{k}-\bar{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2}\right] \le \frac{1}{T_k}{\mathbb {E}}\left[ \sum _{t=1}^{T_k} L_{i}(\xi _{t}^{k})^{2}\left( \sum _{j<i}\Vert \textbf{x}_{j}^{k+1}-\textbf{x}_{j}^{k}\Vert _{\mathcal {E}_{j}}^{2}\right) \right] \\&\quad \le \frac{1}{T_k}{\mathbb {E}}\left[ 2\sum _{t=1}^{T_k} L_{i}(\xi _{t}^{k})^{2}\left( \sum _{j<i}\left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2} \left( \Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2}+L_{r_j}^{2}\right) \right) \right] \\&\quad \le \frac{1}{T_k}{\mathbb {E}}\left[ 2\sum _{t=1}^{T_k} \left( \sum _{j<i}\left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2} \left( \frac{1}{T_k}\sum _{s=1}^{T_k} L_{i}(\xi _{t}^{k})^{2}\Vert \textbf{G}_{j}(\textbf{x}_{<j}^{k+1},\textbf{x}_{\ge j}^{k},\xi _{s}^{k})\Vert _{\mathcal {E}_j,*}^{2}\right) \right) \right] \\&\qquad +\frac{1}{T_k}{\mathbb {E}}\left[ 2\sum _{t=1}^{T_k} \left( \sum _{j<i}L_{i}(\xi _{t}^{k})^{2}L_{r_j}^{2} \left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2}\right) \right] \\&\quad \le 2mL_{i}^{2}\left( \delta (\gamma _{max}^{k}) + L_{r}^{2}\right) \left( \frac{\gamma _{max}^{k}}{\alpha }\right) ^{2}, \end{aligned} \end{aligned}$$

(72)

which, together with Lemma 3, gives that

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left[ \Vert \tilde{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2} \right]&\le 2{\mathbb {E}}\left[ \Vert \tilde{\textbf{G}}_{i}^{k}-\bar{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2} \right] +2{\mathbb {E}}\left[ \Vert \bar{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2} \right] \\&\le 4mL_{i}^{2}\left( \delta (\gamma _{max}^{k})^{2} + L_{r}^{2}\right) \left( \frac{\gamma _{max}^{k}}{\alpha }\right) ^{2} + 2\left( M_{i}^{2} + \frac{2c_{i}^{2}\sigma _{i}^{2}}{T_{k}}\right) \\&= \varepsilon _{i}(\gamma _{max}^{k},T_{k}). \end{aligned} \end{aligned}$$

Furthermore, by the above observation, we conclude that

$$\begin{aligned} {\mathbb {E}}\left[ \Vert \textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}^{2} \right] \le 2\left( \varepsilon _{i}(\gamma _{max}^{k},T_{k}) + L_{r_i}^{2}\right) \left( \frac{\gamma _{i}^{k}}{\alpha _{i}}\right) ^{2}, \end{aligned}$$

(73)

completing the proof. $\square $

Proof of Lemma 7

Let $\textbf{x}_{\textbf{g}_{i}}^{k+1} =\textbf{x}_{i}^{k}-\gamma _{i}^{k}\mathcal {G}_{i}\left( \textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k} \right) =P_{i}\left( \textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k} \right) $. By the optimality condition of (48), and the definition of $\textbf{x}_{i}^{k+1}$ and $\textbf{x}_{\textbf{g}_{i}}^{k+1}$, we have

$$\begin{aligned} \left\langle \tilde{\textbf{G}}_{i}^{k}+\frac{1}{\gamma _{i}^{k}}\left( \nabla \omega _{i}(\textbf{x}_{i}^{k+1}) -\nabla \omega _{i}(\textbf{x}_{i}^{k}) \right) ,\textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\right\rangle +r_{i}(\textbf{x}_{\textbf{g}_{i}}^{k+1})-r_{i}(\textbf{x}_{i}^{k+1})\ge 0, \end{aligned}$$

and

$$\begin{aligned} \langle \textbf{g}_{i}(\textbf{x}^{k})+\frac{1}{\gamma _{i}^{k}}\left( \nabla \omega _{i}(\textbf{x}_{\textbf{g}_{i}}^{k+1}) -\nabla \omega _{i}(\textbf{x}_{i}^{k}) \right) ,\textbf{x}_{i}^{k+1}-\textbf{x}_{\textbf{g}_{i}}^{k+1}\rangle +r_{i}(\textbf{x}_{i}^{k+1})-r_{i}(\textbf{x}_{\textbf{g}_{i}}^{k+1})\ge 0, \end{aligned}$$

respectively. Summing up the above two inequalities, one has

$$\begin{aligned} \begin{aligned} \langle \tilde{\textbf{G}}_{i}^{k}-\textbf{g}_{i}(\textbf{x}^{k}),\textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\rangle \ge&\frac{1}{\gamma _{i}^{k}}\langle \nabla \omega _{i}(\textbf{x}_{\textbf{g}_{i}}^{k+1}) -\nabla \omega _{i}(\textbf{x}_{i}^{k+1}),\textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\rangle \\ \ge&\frac{\alpha _{i}}{\gamma _{i}^{k}}\Vert \textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}^{2}. \end{aligned} \end{aligned}$$

where the second inequality follows from the strong convexity of $\omega _{i}$. Then, it holds that

$$\begin{aligned} \frac{1}{\gamma _{i}^{k}}\Vert \textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}\le \frac{1}{\alpha _{i}}\Vert \tilde{\textbf{G}}_{i}^{k}-\textbf{g}_{i}(\textbf{x}^{k})\Vert _{\mathcal {E}_i,*}. \end{aligned}$$

Using the above relation, we obtain

$$\begin{aligned} \begin{aligned} \left\| \mathcal {G}_{i}(\textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k}) -\frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1})\right\| _{\mathcal {E}_i} =&\left\| \frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{\textbf{g}_{i}}^{k+1}) -\frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}) \right\| _{\mathcal {E}_i} \\ \le&\frac{1}{\alpha _{i}}\Vert \tilde{\textbf{G}}_{i}^{k}-\textbf{g}_{i}(\textbf{x}^{k})\Vert _{\mathcal {E}_i,*}. \end{aligned} \end{aligned}$$

Hence, we have

$$\begin{aligned} \begin{aligned} \left\| \mathcal {G}_{i}(\textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k})\right\| _{\mathcal {E}_i}^{2} =&\left\| \mathcal {G}_{i}(\textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k}) -\frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}) +\frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1})\right\| _{\mathcal {E}_i}^{2} \\ \le&2\left( \frac{1}{\alpha _{i}^{2}}\Vert \tilde{\textbf{G}}_{i}^{k}-\textbf{g}_{i}(\textbf{x}^{k})\Vert _{\mathcal {E}_i,*}^{2} +\left\| \frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1})\right\| _{\mathcal {E}_i}^{2}\right) , \end{aligned} \end{aligned}$$

and obtain the result as required in (51).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, J., Song, H., Li, X. et al. Block Mirror Stochastic Gradient Method For Stochastic Optimization. J Sci Comput 94, 69 (2023). https://doi.org/10.1007/s10915-023-02110-y

Download citation

Received: 16 January 2021
Revised: 25 October 2022
Accepted: 07 January 2023
Published: 08 February 2023
DOI: https://doi.org/10.1007/s10915-023-02110-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Block Mirror Stochastic Gradient Method For Stochastic Optimization

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A: Proofs of some lemmas in this paper

Appendix A: Proofs of some lemmas in this paper

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 3

Proof of Lemma 4

Proof of Lemma 7

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation