Skip to main content
Log in

Block Mirror Stochastic Gradient Method For Stochastic Optimization

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

In this paper, a block mirror stochastic gradient method is developed to solve stochastic optimization problems involving convex and nonconvex cases, where the feasible set and the variables are treated as multiple blocks. The proposed method combines the features of the classic mirror descent stochastic method and the block coordinate gradient descent method. Acquiring the stochastic gradient information by stochastic oracles, our method updates all the blocks of variables in the Gauss–Seidel type. We establish the convergence for both convex and nonconvex cases. The analysis of our method is challenging because the typical unbiasedness assumption of stochastic gradient fails to hold in the Gauss–Seidel renewal type and requires more specific assumptions. The proposed algorithm is tested on the conditional value-at-risk problem and the stochastic LASSO problem to demonstrate the efficiency of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Downloaded from http://www.resset.cn/.

  2. Downloaded from https://archive.ics.uci.edu/ml/datasets/.

References

  1. Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, 8194–8244 (2017)

    MathSciNet  MATH  Google Scholar 

  2. Bregman, L.M.: The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. Comput. Math. Math. Phys. 7, 200–217 (1967)

    Article  MathSciNet  Google Scholar 

  3. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  4. Beck, A.: First-Order Methods in Optimization. Society for Industrial and Applied Mathematics (2017)

  5. Buza, K.: Feedback prediction for blogs. In: Data Analysis, Machine Learning and Knowledge Discovery, pp. 145–152. Springer (2014)

  6. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inf. Process. Syst. 2, 1646–1654 (2014)

    Google Scholar 

  7. Dang, C.D., Lan, G.: Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM J. Optim. 25, 856–881 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  8. Fu, M: Optimization for simulation: theory vs. practice. INFORMS J. Comput. 14, 192–215 (2002)

  9. Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34, A1380–A1405 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  10. Glasserman, P.: Gradient Estimation via Perturbation Analysis. Kluwer, Boston (2003)

    MATH  Google Scholar 

  11. Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 69–77 (2011)

  12. Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework. SIAM J. Optim. 22, 1469–1492 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  13. Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23, 2061–2089 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  14. Hildreth, C.: A quadratic programming procedure. Nav. Res. Logist. Q. 4, 79–85 (1957)

    Article  MathSciNet  Google Scholar 

  15. Juditsky, A., Nemirovski, A.S.: First order methods for nonsmooth convex large-scale optimization, I: general purpose methods. Optim. Mach. Learn. 30(9), 121–148 (2011)

    Google Scholar 

  16. Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133, 365–397 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  17. Lan, G.: Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Math. Program. 149, 1–45 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  18. Lan, G.: First-Order and Stochastic Optimization Methods for Machine Learning. Springer (2020)

  19. Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. Adv. Neural Inf. Process. Syst. 2, 3384–3392 (2015)

    Google Scholar 

  20. Linderoth, J., Shapiro, A., Wright, S.: The empirical behavior of sampling methods for stochastic programming. Ann. Oper. Res. 142, 215–241 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  21. Lu. Z., Xiao. L.: Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. arXiv preprint. arXiv:1306.5918 (2013)

  22. Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)

    Google Scholar 

  23. Nemirovski, A.S., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. A. SIAM J. Optim. 19, 1574–1609 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  24. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč. M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, pp. 2613–2621 (2017)

  25. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004)

    Book  MATH  Google Scholar 

  26. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22, 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  27. Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. arXiv preprint, arXiv:1902.05679 (2019)

  28. Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5, 201–226 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  29. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  30. Rockafellar, R., Wets, J.: Variational Analysis. Springer, Berlin (1998)

  31. Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method. Wiley (1993)

  32. Shapiro, A., Nemirovski, A.: On complexity of stochastic programming problems. In: Jeyakumar, V., Rubinov, A.M. (eds.) Continuous Optimization: Current Trends and Applications, pp. 111–144. Springer, New York (2005)

    Chapter  Google Scholar 

  33. Sospedra, J.T., Montoliu, R., et al.: UJIIndoorLoc: a new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: Proceedings of the Fifth International Conference on Indoor Positioning and Indoor Navigation (2014)

  34. Schmidt, M., Roux, N.L., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  35. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)

    MathSciNet  MATH  Google Scholar 

  36. Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  37. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter. A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127, 3–30 (2011)

  38. Sun, R., Ye, Y.: Worst-case complexity of cyclic coordinate descent: \(O (n^2)\) gap with randomized version. Math. Program. 185, 487–520 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  39. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  40. Verweij, B., Ahmed, S., Kleywegt, A.J., Nemhauser, G., Shapiro, A.: The sample average approximation method applied to stochastic routing problems: a computational study. Comput. Optim. Appl. 24, 289–333 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  41. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24, 2057–2075 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  42. Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25, 1686–1716 (2015)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Prof. Chengbo Yang for discussing the conditional value-at-risk problem in Sect. 4. The authors also sincerely thank the anonymous referees for their valuable comments and suggestions, which helped improve the manuscript significantly.

Funding

The research is partly supported by the National Key Research and Development Program of China (2020YFA0714101), NSFC (11701210, 11601183, 61872162, 12171199), the Education Department Project of Jilin Province (JJKH20211031KJ), the Science and Technology Department of Jilin Province (20180520212JH, 20190103029JH, 20200201269JC, 20210201015GX), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Haiming Song or Xinxin Li.

Ethics declarations

Conflict of interest

All the authors declare they have no financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs of some lemmas in this paper

Appendix A: Proofs of some lemmas in this paper

Proof of Lemma 1

From the strong convexity of \(\omega _{i}\), it holds

$$\begin{aligned} \begin{aligned} \omega _{i}(\textbf{x}_{i})-\omega _{i}(\textbf{y}_{i}) \ge&\langle \nabla \omega _{i}(\textbf{y}_{i}),\textbf{x}_{i}-\textbf{y}_{i}\rangle +\frac{\alpha _{i}}{2}\Vert \textbf{y}_{i}-\textbf{x}_{i}\Vert _{\mathcal {E}_i}^{2} \\ \ge&-\frac{1}{2\alpha _{i}}\left\| \nabla \omega _{i}(\textbf{y}_{i}) \right\| _{\mathcal {E}_i,*}^{2}, \end{aligned} \end{aligned}$$

where the second inequality follows from (6). It implies

$$\begin{aligned} \omega _{i}(\textbf{y}_{i})-\omega _{i}(\textbf{x}_{i}) \le \frac{1}{2\alpha _{i}}\left\| \nabla \omega _{i}(\textbf{y}_{i}) \right\| _{\mathcal {E}_i,*}^{2}. \end{aligned}$$

Let us consider the function \(\varphi _{\textbf{x}_{i}}(\textbf{z})=\omega _{i}(\textbf{z}) -\langle \nabla \omega _{i}(\textbf{x}_{i}),\textbf{z}\rangle \). It is easy to see that \(\varphi _{\textbf{x}_{i}}(\textbf{z})\) is the strongly convex function with the same parameter \(\alpha _{i}\) since

$$\begin{aligned} \begin{aligned} \langle \nabla \varphi _{\textbf{x}_{i}}(\textbf{z}_{1}) -\nabla \varphi _{\textbf{x}_{i}}(\textbf{z}_{2}),\textbf{z}_{1}-\textbf{z}_{2}\rangle =&\langle \nabla \omega _{i}(\textbf{z}_{1})-\nabla \omega _{i}(\textbf{z}_{2}),\textbf{z}_{1}-\textbf{z}_{2}\rangle \\ \ge&\alpha _{i}\Vert \textbf{z}_{1}-\textbf{z}_{2}\Vert _{\mathcal {E}_i,*}^{2}. \end{aligned} \end{aligned}$$

Then we have

$$\begin{aligned} \begin{aligned} \omega _{i}(\textbf{y}_{i})-\omega _{i}(\textbf{x}_{i})-\langle \nabla \omega _{i}(\textbf{x}_{i}),\textbf{y}_{i}-\textbf{x}_{i}\rangle =&\varphi _{\textbf{x}_{i}}(\textbf{y}_{i})-\varphi _{\textbf{x}_{i}}(\textbf{x}_{i}) \\ \le&\frac{1}{2\alpha _{i}}\left\| \nabla \varphi _{\textbf{x}_{i}}(\textbf{y}_{i}) \right\| _{\mathcal {E}_i,*}^{2}\\ =&\frac{1}{2\alpha _{i}} \left\| \nabla \omega _{i}(\textbf{y}_{i})-\nabla \omega _{i}(\textbf{x}_{i}) \right\| _{\mathcal {E}_i,*}^{2}, \end{aligned} \end{aligned}$$

which gives the result. \(\square \)

Proof of Lemma 2

Note that \(\sum _{t=1}^{T-1}\textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k})\) is independent of \(\textbf{G}_{i}(\textbf{x}^{k},\xi _{T}^{k}) - \textbf{g}_{i}(\textbf{x}^{k})\) on \(\{\xi _{t}^{k}\}_{t=1}^{T-1}\) for any \(T \ge 2\). This together with (2) yields that

$$\begin{aligned} {\mathbb {E}}\left[ \langle \sum _{t=1}^{T-1} \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) -\textbf{g}_{i}(\textbf{x}^{k}), \textbf{G}_{i}(\textbf{x}^{k},\xi _{T}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \rangle | \{\xi _{t}^{k}\}_{t=1}^{T-1} \right] = 0,\quad \forall ~T\ge 2. \end{aligned}$$

Then for Euclidean norm \(\Vert \cdot \Vert \), we have

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left[ \left\| \sum _{t=1}^{T_{k}}\textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2}\right] =&{\mathbb {E}}\left[ \left\| \sum _{t=1}^{T_{k}-1}\textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] \\&+{\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}^{k},\xi _{T_{k}}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] \\ =&\sum _{t=1}^{T_{k}}{\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] . \end{aligned} \end{aligned}$$

Under Assumption 1, it implies that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \bar{\textbf{G}}_{i}^{k} - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] =\frac{1}{T_{k}^2} {\mathbb {E}}\left[ \left\| \sum _{t=1}^{T_{k}}\textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) - \textbf{g}_{i}(\textbf{x}^{k}) \right\| ^{2} \right] \le \frac{\sigma _{i}^{2}}{T_{k}}. \end{aligned}$$

Then the equivalence relation between the norm \(\Vert \cdot \Vert _{\mathcal {E}_i,*}\) and the norm \(\Vert \cdot \Vert \) on \({\mathbb {R}}^{n_i}\) completes the proof. \( \square \)

Proof of Lemma 3

By Remark 2, it can be seen from (10), (11), and (14) that for any i and k

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}^{k},\xi ) \right\| _{\mathcal {E}_i,*}^{2} \right] \le&2\left\| \textbf{g}_{i}(\textbf{x}^{k}) \right\| _{\mathcal {E}_i,*}^{2} +2{\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}^{k},\xi ) -\textbf{g}_{i}(\textbf{x}^{k}) \right\| _{\mathcal {E}_i,*}^{2} \right] \\ \le&2\left( 2\left\| \textbf{g}_{i}(\textbf{x}^{k})-\textbf{g}_{i}(\textbf{x}^{1}) \right\| _{\mathcal {E}_i,*}^{2} +2\Vert \textbf{g}_{i}(\textbf{x}^{1})\Vert _{\mathcal {E}_i,*}^{2} \right) +2\sigma _{i}^{2},\\ \le&4L_{i}^{2}\Vert \textbf{x}^{k}-\textbf{x}^{1}\Vert _{\mathcal {E}_i}^{2} +4\left\| \textbf{g}_{i}(\textbf{x}^{1}) \right\| _{\mathcal {E}_i,*}^{2}+2\sigma _{i}^{2}\\ \le&16L_{i}^{2}\rho ^{2}+4\left\| \textbf{g}_{i}(\textbf{x}^{1}) \right\| _{\mathcal {E}_i,*}^{2} +2\sigma _{i}^{2}\\ \le&M_{i}^{2} + 2\sigma _{i}^{2}.\\ \end{aligned} \end{aligned}$$

Also it holds by Lemma 2 that

$$\begin{aligned} {\mathbb {E}}\left[ \Vert \bar{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2} \right] \le 2\left\| \textbf{g}_{i}(\textbf{x}^{k}) \right\| _{\mathcal {E}_i,*}^{2} +\frac{2c_{i}^{2}\sigma _{i}^{2}}{T_{k}} \le M_{i}^{2} + \frac{2c_{i}^{2}\sigma _{i}^{2}}{T_{k}}. \end{aligned}$$

We complete the proof of Lemma 3. \(\square \)

Proof of Lemma 4

By (8) and the optimality of \(\textbf{x}_{i}^{k+1}\), we have

$$\begin{aligned} \langle \tilde{\textbf{G}}_{i}^{k}+\frac{1}{\gamma _{i}^{k}}(\nabla \omega _{i}(\textbf{x}_{i}^{k+1}) -\nabla \omega _{i}(\textbf{x}_{i}^{k})),\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\rangle +r_{i}(\textbf{x}_{i}^{k})-r_{i}(\textbf{x}_{i}^{k+1})\ge 0. \end{aligned}$$

Because of the strong convexity of \(\omega _{i}\), we obtain

$$\begin{aligned} \begin{aligned} \langle \tilde{\textbf{G}}_{i}^{k},\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\rangle +r_{i}(\textbf{x}_{i}^{k})-r_{i}(\textbf{x}_{i}^{k+1}) \ge&\frac{1}{\gamma _{i}^{k}}\langle \nabla \omega _{i}(\textbf{x}_{i}^{k}) -\nabla \omega _{i}(\textbf{x}_{i}^{k+1}),\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\rangle \\ \ge&\frac{\alpha _{i}}{\gamma _{i}^{k}}\Vert \textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}^{2}. \end{aligned} \end{aligned}$$

On the other side, it holds that by Assumption 3

$$\begin{aligned} \langle \tilde{\textbf{G}}_{i}^{k},\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\rangle +r_{i}(\textbf{x}_{i}^{k})-r_{i}(\textbf{x}_{i}^{k+1}) \le \left( \Vert \tilde{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}+L_{r_i} \right) \Vert \textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}. \end{aligned}$$

Then, we obtaind that

$$\begin{aligned} \Vert \textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i} \le \frac{\gamma _{i}^{k}}{\alpha _{i}} \left( \Vert \tilde{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}+L_{r_i} \right) . \end{aligned}$$

In addition, under Assumption 2, for any \(\xi _{t}^{k} \in \Xi ^{k} \subseteq \Xi \), it holds that

$$\begin{aligned} \begin{aligned} \left\| \textbf{G}_{i}(\textbf{x}_{<i}^{k+1},\textbf{x}_{\ge i}^{k},\xi _{t}^{k}) - \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2}&\le L_{i}(\xi _{t}^{k})^{2}\left( \sum _{j<i}\Vert \textbf{x}_{j}^{k+1}-\textbf{x}_{j}^{k}\Vert _{\mathcal {E}_{j}}^{2}\right) \\&\le 2L_{i}(\xi _{t}^{k})^{2}\left( \sum _{j<i}\left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2} \left( \Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2}+L_{r_j}^{2}\right) \right) \\&= 2\sum _{j<i}\left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2}\left( L_{i}(\xi _{t}^{k})^{2}\Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2} +L_{i}(\xi _{t}^{k})^{2}L_{r_j}^{2}\right) . \\ \end{aligned} \end{aligned}$$

Note that

$$\begin{aligned} L_{i}(\xi _{t}^{k})^{2}\Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2}\le \frac{1}{T_k}\sum _{s=1}^{T_k} L_{i}(\xi _{t}^{k})^{2}\left\| \textbf{G}_{j}(\textbf{x}_{<j}^{k+1},\textbf{x}_{\ge j}^{k},\xi _{s}^{k}) \right\| _{\mathcal {E}_j,*}^{2}, \end{aligned}$$
(68)

then by the uncorrelated condition, we have

$$\begin{aligned} {\mathbb {E}}\left[ L_{i}(\xi _{t}^{k})^{2}\Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2} \right] \le \frac{1}{T_k}\sum _{s=1}^{T_k}L_{i}^{2}{\mathbb {E}}\left[ \left\| \textbf{G}_{j}(\textbf{x}_{<j}^{k+1},\textbf{x}_{\ge j}^{k},\xi _{s}^{k}) \right\| _{\mathcal {E}_j,*}^{2} \right] . \end{aligned}$$
(69)

By using the above observation and an auxiliary notation \(\Lambda \) denoted by

$$\begin{aligned} \Lambda = \max _{j} {\mathbb {E}}\left[ \left\| \textbf{G}_{j}(\textbf{x}_{<j}^{k+1},\textbf{x}_{\ge j}^{k},\xi ) \right\| _{\mathcal {E}_j,*}^{2}\right] , \end{aligned}$$

we get that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \textbf{G}_{i}(\textbf{x}_{<i}^{k+1},\textbf{x}_{\ge i}^{k},\xi _{t}^{k}) - \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2} \right] \le 2mL^{2}\left( \Lambda +L_{r}^{2}\right) \left( \frac{\gamma _{max}^{k}}{\alpha }\right) ^{2}. \end{aligned}$$
(70)

Also, observe that

$$\begin{aligned} \left\| \textbf{G}_{i}(\textbf{x}_{<i}^{k+1},\textbf{x}_{\ge i}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2} \le 2\left\| \textbf{G}_{i}(\textbf{x}_{<i}^{k+1},\textbf{x}_{\ge i}^{k},\xi _{t}^{k}) - \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2} + 2\left\| \textbf{G}_{i}(\textbf{x}^{k},\xi _{t}^{k}) \right\| _{\mathcal {E}_i,*}^{2}, \end{aligned}$$

and by the same argument in the proof of Lemma 3 and the definition of \(\Lambda \), it implies that

$$\begin{aligned} \Lambda \le \frac{4mL^{2}L_{r}^{2}(\gamma _{max}^{k})^{2} + 2\alpha ^{2}(M^{2}+2\sigma ^{2})}{\alpha ^{2}-4mL^{2}(\gamma _{max}^{k})^{2}} = \delta (\gamma _{max}^{k}). \end{aligned}$$
(71)

Moreover, it is clear that by Assumption 2

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}\left[ \Vert \tilde{\textbf{G}}_{i}^{k}-\bar{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2}\right] \le \frac{1}{T_k}{\mathbb {E}}\left[ \sum _{t=1}^{T_k} L_{i}(\xi _{t}^{k})^{2}\left( \sum _{j<i}\Vert \textbf{x}_{j}^{k+1}-\textbf{x}_{j}^{k}\Vert _{\mathcal {E}_{j}}^{2}\right) \right] \\&\quad \le \frac{1}{T_k}{\mathbb {E}}\left[ 2\sum _{t=1}^{T_k} L_{i}(\xi _{t}^{k})^{2}\left( \sum _{j<i}\left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2} \left( \Vert \tilde{\textbf{G}}_{j}^{k}\Vert _{\mathcal {E}_j,*}^{2}+L_{r_j}^{2}\right) \right) \right] \\&\quad \le \frac{1}{T_k}{\mathbb {E}}\left[ 2\sum _{t=1}^{T_k} \left( \sum _{j<i}\left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2} \left( \frac{1}{T_k}\sum _{s=1}^{T_k} L_{i}(\xi _{t}^{k})^{2}\Vert \textbf{G}_{j}(\textbf{x}_{<j}^{k+1},\textbf{x}_{\ge j}^{k},\xi _{s}^{k})\Vert _{\mathcal {E}_j,*}^{2}\right) \right) \right] \\&\qquad +\frac{1}{T_k}{\mathbb {E}}\left[ 2\sum _{t=1}^{T_k} \left( \sum _{j<i}L_{i}(\xi _{t}^{k})^{2}L_{r_j}^{2} \left( \frac{\gamma _{j}^{k}}{\alpha _{j}}\right) ^{2}\right) \right] \\&\quad \le 2mL_{i}^{2}\left( \delta (\gamma _{max}^{k}) + L_{r}^{2}\right) \left( \frac{\gamma _{max}^{k}}{\alpha }\right) ^{2}, \end{aligned} \end{aligned}$$
(72)

which, together with Lemma 3, gives that

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left[ \Vert \tilde{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2} \right]&\le 2{\mathbb {E}}\left[ \Vert \tilde{\textbf{G}}_{i}^{k}-\bar{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2} \right] +2{\mathbb {E}}\left[ \Vert \bar{\textbf{G}}_{i}^{k}\Vert _{\mathcal {E}_i,*}^{2} \right] \\&\le 4mL_{i}^{2}\left( \delta (\gamma _{max}^{k})^{2} + L_{r}^{2}\right) \left( \frac{\gamma _{max}^{k}}{\alpha }\right) ^{2} + 2\left( M_{i}^{2} + \frac{2c_{i}^{2}\sigma _{i}^{2}}{T_{k}}\right) \\&= \varepsilon _{i}(\gamma _{max}^{k},T_{k}). \end{aligned} \end{aligned}$$

Furthermore, by the above observation, we conclude that

$$\begin{aligned} {\mathbb {E}}\left[ \Vert \textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}^{2} \right] \le 2\left( \varepsilon _{i}(\gamma _{max}^{k},T_{k}) + L_{r_i}^{2}\right) \left( \frac{\gamma _{i}^{k}}{\alpha _{i}}\right) ^{2}, \end{aligned}$$
(73)

completing the proof. \(\square \)

Proof of Lemma 7

Let \(\textbf{x}_{\textbf{g}_{i}}^{k+1} =\textbf{x}_{i}^{k}-\gamma _{i}^{k}\mathcal {G}_{i}\left( \textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k} \right) =P_{i}\left( \textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k} \right) \). By the optimality condition of (48), and the definition of \(\textbf{x}_{i}^{k+1}\) and \(\textbf{x}_{\textbf{g}_{i}}^{k+1}\), we have

$$\begin{aligned} \left\langle \tilde{\textbf{G}}_{i}^{k}+\frac{1}{\gamma _{i}^{k}}\left( \nabla \omega _{i}(\textbf{x}_{i}^{k+1}) -\nabla \omega _{i}(\textbf{x}_{i}^{k}) \right) ,\textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\right\rangle +r_{i}(\textbf{x}_{\textbf{g}_{i}}^{k+1})-r_{i}(\textbf{x}_{i}^{k+1})\ge 0, \end{aligned}$$

and

$$\begin{aligned} \langle \textbf{g}_{i}(\textbf{x}^{k})+\frac{1}{\gamma _{i}^{k}}\left( \nabla \omega _{i}(\textbf{x}_{\textbf{g}_{i}}^{k+1}) -\nabla \omega _{i}(\textbf{x}_{i}^{k}) \right) ,\textbf{x}_{i}^{k+1}-\textbf{x}_{\textbf{g}_{i}}^{k+1}\rangle +r_{i}(\textbf{x}_{i}^{k+1})-r_{i}(\textbf{x}_{\textbf{g}_{i}}^{k+1})\ge 0, \end{aligned}$$

respectively. Summing up the above two inequalities, one has

$$\begin{aligned} \begin{aligned} \langle \tilde{\textbf{G}}_{i}^{k}-\textbf{g}_{i}(\textbf{x}^{k}),\textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\rangle \ge&\frac{1}{\gamma _{i}^{k}}\langle \nabla \omega _{i}(\textbf{x}_{\textbf{g}_{i}}^{k+1}) -\nabla \omega _{i}(\textbf{x}_{i}^{k+1}),\textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\rangle \\ \ge&\frac{\alpha _{i}}{\gamma _{i}^{k}}\Vert \textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}^{2}. \end{aligned} \end{aligned}$$

where the second inequality follows from the strong convexity of \(\omega _{i}\). Then, it holds that

$$\begin{aligned} \frac{1}{\gamma _{i}^{k}}\Vert \textbf{x}_{\textbf{g}_{i}}^{k+1}-\textbf{x}_{i}^{k+1}\Vert _{\mathcal {E}_i}\le \frac{1}{\alpha _{i}}\Vert \tilde{\textbf{G}}_{i}^{k}-\textbf{g}_{i}(\textbf{x}^{k})\Vert _{\mathcal {E}_i,*}. \end{aligned}$$

Using the above relation, we obtain

$$\begin{aligned} \begin{aligned} \left\| \mathcal {G}_{i}(\textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k}) -\frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1})\right\| _{\mathcal {E}_i} =&\left\| \frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{\textbf{g}_{i}}^{k+1}) -\frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}) \right\| _{\mathcal {E}_i} \\ \le&\frac{1}{\alpha _{i}}\Vert \tilde{\textbf{G}}_{i}^{k}-\textbf{g}_{i}(\textbf{x}^{k})\Vert _{\mathcal {E}_i,*}. \end{aligned} \end{aligned}$$

Hence, we have

$$\begin{aligned} \begin{aligned} \left\| \mathcal {G}_{i}(\textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k})\right\| _{\mathcal {E}_i}^{2} =&\left\| \mathcal {G}_{i}(\textbf{x}_{i}^{k},\textbf{g}(\textbf{x}^{k}),\gamma _{i}^{k}) -\frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1}) +\frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1})\right\| _{\mathcal {E}_i}^{2} \\ \le&2\left( \frac{1}{\alpha _{i}^{2}}\Vert \tilde{\textbf{G}}_{i}^{k}-\textbf{g}_{i}(\textbf{x}^{k})\Vert _{\mathcal {E}_i,*}^{2} +\left\| \frac{1}{\gamma _{i}^{k}}(\textbf{x}_{i}^{k}-\textbf{x}_{i}^{k+1})\right\| _{\mathcal {E}_i}^{2}\right) , \end{aligned} \end{aligned}$$

and obtain the result as required in (51).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, J., Song, H., Li, X. et al. Block Mirror Stochastic Gradient Method For Stochastic Optimization. J Sci Comput 94, 69 (2023). https://doi.org/10.1007/s10915-023-02110-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-023-02110-y

Keywords

Navigation