Skip to main content
Log in

A Stochastic Subgradient Method for Distributionally Robust Non-convex and Non-smooth Learning

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

We consider a distributionally robust formulation of stochastic optimization problems arising in statistical learning, where robustness is with respect to ambiguity in the underlying data distribution. Our formulation builds on risk-averse optimization techniques and the theory of coherent risk measures. It uses mean–semideviation risk for quantifying uncertainty, allowing us to compute solutions that are robust against perturbations in the population data distribution. We consider a broad class of generalized differentiable loss functions that can be non-convex and non-smooth, involving upward and downward cusps, and we develop an efficient stochastic subgradient method for distributionally robust problems with such functions. We prove that it converges to a point satisfying the optimality conditions. To our knowledge, this is the first method with rigorous convergence guarantees in the context of generalized differentiable non-convex and non-smooth distributionally robust stochastic optimization. Our method allows for the control of the desired level of robustness with little extra computational cost compared to population risk minimization with stochastic gradient methods. We also illustrate the performance of our algorithm on real datasets arising in convex and non-convex supervised learning problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. From the update rule of \(y^k\), it follows that the variable \(y^k\) is the projection of \(x^k-z^k/c\) onto the constraint set X, where 1/c can be interpreted as the stepsize. This projection step ensures that the iterates \(y^k\) lie in the constraint set X.

  2. This statement follows from the following argument: If \(x^*\in X^*\), then by definition (17), there exists \(z^*\in \hat{\partial } F(x^*)\) such that \(-z^*\in N_X(x^*)\), which is equivalent to \(\langle z^*, y-x^*\rangle \ge 0\) for every \(y \in X\). This, together with the definition (18) of the gap function, implies that \(\eta (x^*,z^*)\ge 0\), which yields \(\eta (x^*,z^*) = 0\), due to (20). The other direction can be proved in a similar way. If \(z^*\in \hat{\partial } F(x^*)\) exists such that \(\eta (x^*,z^*)=0\), then by definition (18), \(\langle z^*, y-x^*\rangle \ge 0\) for every \(y \in X\); otherwise, one gets a contradiction. The latter statement is equivalent to \(-z^* \in N_X(x^*)\), and consequently, we obtain \(x^*\in X\).

  3. Notice that in the definition of \(\ell _{2c}\), we have necessarily \(p_2 - \varkappa p_1 - g_{au}>0\) as \(p_2>0\) and \(-\varkappa p_1 - g_{au} \ge 0\) by (25). Similarly, \(p_2 - \varkappa p_1 - g_{bu}>0\). Therefore, the denominator \(p_2 - \varkappa p_1 - sg_{au} - (1-s)g_{bu}>0\).

  4. There are also adversarial learning methods [25, 31, 35, 64] where the aim is to be resistant to norm-bounded perturbations of the input before we have access to it; however, we do not compare with these methods as our formulation (4) focuses on a distributional distortion.

References

  1. Allen-Zhu, Z., Elad, H.: Variance reduction for faster non-convex optimization. In: Maria Florina, B., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, vol. 48 of Proceedings of Machine Learning Research, pp. 699–707. New York, New York, USA, 20–22 Jun 2016. PMLR

  2. Artzner, P., Delbaen, F., Eber, J.-M., Heath, D.: Coherent measures of risk. Math. Finance 9, 203–228 (1999)

    Article  MathSciNet  Google Scholar 

  3. Baker, J.W., Schubert, M., Faber, M.H.: On the assessment of robustness. Struct. Safety 30(3), 253–267 (2008)

    Article  Google Scholar 

  4. Bonnans, J.F., Alexander, S.: Perturbation Analysis of Optimization Problems. Springer (2013)

  5. Brézis, H.: Monotonicity methods in Hilbert spaces and some applications to nonlinear partial differential equations. In: Contributions to Nonlinear Functional Analysis, pp. 101–156. Elsevier (1971)

  6. Bubeck, S.: Convex optimization: Algorithms and complexity. Found. Trends \({\mathring{R}}\) Mach. Learn. 8(3–4), 231–357 (2015)

  7. Clarke, F.H.: Generalized gradients and applications. Trans. Am. Math. Soc. 205, 247–262 (1975)

    Article  MathSciNet  Google Scholar 

  8. Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B.: Robust statistics in data analysis—a review: basic concepts. Chemometr. Intell. Lab. Syst. 85(2), 203–219 (2007)

  9. Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019)

    Article  MathSciNet  Google Scholar 

  10. Dentcheva, D., Penev, S., Ruszczyński, A.: Statistical estimation of composite risk functionals and risk optimization problems. Ann. Inst. Stat. Math. 69(4), 737–760 (2017)

    Article  MathSciNet  Google Scholar 

  11. Drusvyatskiy, D., Ioffe, A.D., Lewis, A.S.: Curves of descent. SIAM J. Control Optim. 53(1), 114–138 (2015)

    Article  MathSciNet  Google Scholar 

  12. Dheeru, D., Casey, G.: UCI Machine Learning Repository (2017) https://archive.ics.uci.edu/ml/index.php

  13. Duchi, J.C., Ruan, F.: Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28(4), 3229–3259 (2018)

    Article  MathSciNet  Google Scholar 

  14. Duchi, J.C., Namkoong, H.: Learning models with uniform performance via distributionally robust optimization. Ann. Stat. 49(3), 1378–1406 (2021)

  15. Ermoliev, Y.M.: Methods of Stochastic Programming. Nauka, Moscow (1976)

    Google Scholar 

  16. Ermoliev, Y.M., Norkin, V.I.: Sample average approximation method for compound stochastic optimization problems. SIAM J. Optim. 23(4), 2231–2263 (2013)

    Article  MathSciNet  Google Scholar 

  17. Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171(1–2), 115–166 (2018)

  18. Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter (2011)

  19. Foster, D.J., Sekhari, A., Sridharan, K.: Uniform convergence of gradients for non-convex learning and optimization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 8745–8756. Curran Associates, Inc. (2018)

  20. Gao, R., Chen, X., Kleywegt, A.J.: Wasserstein distributional robustness and regularization in statistical learning (2017). arXiv preprint arXiv:1712.06050

  21. Gao, R., Kleywegt, A.J.: Distributionally robust stochastic optimization with Wasserstein distance (2016). arXiv preprint arXiv:1604.02199. https://arxiv.org/pdf/1712.06050.pdf

  22. Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4), 2061–2089 (2013)

    Article  MathSciNet  Google Scholar 

  23. Ghadimi, S., Ruszczynski, A., Wang, M.: A single timescale stochastic approximation method for nested stochastic optimization. SIAM J. Optim. 30(1), 960–979 (2020)

    Article  MathSciNet  Google Scholar 

  24. Goodfellow, I., Yoshua, B., Aaron, C.: Deep Learning. MIT Press (2016)

  25. Goodfellow, I.J, Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572

  26. Hastie, T., Tibshirani, R., Wainwright, M.: The Lasso and Generalizations. CRC Press, Statistical learning with sparsity (2015)

  27. Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Accelerating stochastic gradient descent for least squares regression. In: Sébastien, B., Vianney, P., Philippe, R. (eds.) Proceedings of the 31st Conference On Learning Theory, vol. 75 of Proceedings of Machine Learning Research, pp. 545–604 (2018) (PMLR, 06–09 Jul 2018)

  28. Kalogerias, D.S., Powell, W.B.: Recursive optimization of convex risk measures: mean-semideviation models (2018). arXiv preprint arXiv:1804.00636

  29. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)

  30. Kuhn, D., Peyman Mohajerin, E., Viet Anh, N., Soroosh, S.-A.: Wasserstein distributionally robust optimization: theory and applications in machine learning. In: Operations Research & Management Science in the Age of Analytics, pp. 130–166. INFORMS (2019)

  31. Kurakin, A., Ian, G., Samy, B.: Adversarial machine learning at scale (2016). arXiv preprint arXiv:1611.01236

  32. Kushner, H., Yin, G.G.: Stochastic Approximation Algorithms and Applications. Springer, New York (2003)

    MATH  Google Scholar 

  33. LeCun, Y.L., Corinna, C., Burges, C.J.: MNIST handwritten digit database. ATT Labs 2 (2010). http://yann.lecun.com/exdb/mnist

  34. Li, X., Zhihui, Z., Anthony, M.-C.S., Lee, J.D.: Incremental Methods for Weakly Convex Optimization (2019). arXiv e-prints arXiv:1907.11687

  35. Madry, A., Aleksandar, M., Ludwig, S., Dimitris, T., Adrian, V.: Towards deep learning models resistant to adversarial attacks (2017). arXiv preprint arXiv:1706.06083

  36. Majewski, S., Miasojedow, B., Moulines, E.: Analysis of nonsmooth stochastic approximation: the differential inclusion approach (20118). arXiv preprint arXiv:1805.01916

  37. Mehrotra, S., Zhang, H.: Models and algorithms for distributionally robust least squares problems. Math. Program. 146(1), 123–141 (2014)

    Article  MathSciNet  Google Scholar 

  38. Mei, S., Yu, B., Andrea, M.: The landscape of empirical risk for nonconvex losses. Ann. Stat. 46(6A), 2747–2774 (2018)

  39. Mifflin, R.: Semismooth and semiconvex functions in constrained optimization. SIAM J. Control Optim. 15(6), 959–972 (1977)

    Article  MathSciNet  Google Scholar 

  40. Mikhalevich, V.S., Gupal, A.M., Norkin, V.I.: Nonconvex Optimization Methods. Nauka, Moscow (1987)

    MATH  Google Scholar 

  41. Namkoong, H., Duchi, J.C: Stochastic gradient methods for distributionally robust optimization with f-divergences. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc. (2016)

  42. Norkin, V.I.: Generalized-differentiable functions. Cybern. Syst. Anal. 16(1), 10–12 (1980)

    Article  MathSciNet  Google Scholar 

  43. Ogryczak, W., Ruszczyński, A.: From stochastic dominance to mean-risk models: semideviations as risk measures. Eur. J. Oper. Res. 116, 33–50 (1999)

    Article  Google Scholar 

  44. Ogryczak, W., Ruszczyński, A.: On consistency of stochastic dominance and mean-semideviation models. Math. Program. 89, 217–232 (2001)

    Article  MathSciNet  Google Scholar 

  45. Postek, K., den Hertog, D., Melenberg, B.: Computationally tractable counterparts of distributionally robust constraints on risk measures. SIAM Rev. 58(4), 603–650 (2016)

    Article  MathSciNet  Google Scholar 

  46. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)

  47. Ruszczyński, A.: A linearization method for nonsmooth stochastic programming problems. Math. Oper. Res. 12(1), 32–49 (1987)

    Article  MathSciNet  Google Scholar 

  48. Ruszczyński, A., Shapiro, A.: Optimization of convex risk functions. Math. Oper. Res. 31, 433–452 (2006)

    Article  MathSciNet  Google Scholar 

  49. Ruszczyński, A.: Convergence of a stochastic subgradient method with averaging for nonsmooth nonconvex constrained optimization. Optim. Lett. 14, 1615–1625 (2020)

    Article  MathSciNet  Google Scholar 

  50. Ruszczynski, A.: A stochastic subgradient method for nonsmooth nonconvex multilevel composition optimization. SIAM J. Control Optim. 59(3), 2301–2320 (2021)

    Article  MathSciNet  Google Scholar 

  51. Seidman, J.H., Fazlyab, M., Preciado, V.M., Pappas, G.J.: Robust deep learning as optimal control: Insights and convergence guarantees. In: Bayen, A.M., Jadbabaie, A., Pappas, G., Parrilo, P.A., Benjamin, R., Claire, T., Melanie, Z. (eds.) Proceedings of the 2nd Conference on Learning for Dynamics and Control, vol. 120 of Proceedings of Machine Learning Research, pp. 884–893. PMLR, 10–11 (2020)

  52. Soroosh, S.-A., Peyman, M., Esfahani, D.K.: Distributionally robust logistic regression. In: Proceedings of the 28th International Conference on Neural Information Processing Systems—vol. 1. NIPS’15, pp. 1576–1584. Cambridge, MA, USA, 2015. MIT Press (2015)

  53. Shai, S.-S., Shai, B.-D.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press (2014)

  54. Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia (2009)

    Book  Google Scholar 

  55. Sinha, A., Hongseok, N., John, D.: Certifying some distributional robustness with principled adversarial training. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=Hk6kPgZA-

  56. Soma, T., Yuichi, Y.: Statistical learning with conditional value at risk (2020). arXiv preprint arXiv:2002.05826

  57. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  58. Takeda, A., Kanamori, T.: A robust approach based on conditional value-at-risk measure to statistical learning problems. Eur. J. Oper. Res. 198(1), 287–296 (2009)

    Article  MathSciNet  Google Scholar 

  59. Teo, C.H., Vishwanthan, S.V.N., Smola, Alex J., Le, Quoc V.: Bundle methods for regularized risk minimization. J. Mach. Learn. Res. 11(10), 311–365 (2010)

  60. Vladimir, V.: The Nature of Statistical Learning Theory. Springer Science & Business Media (2013)

  61. Wang, M., Fang, E.X., Liu, B.: Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program. 161(1–2), 419–449 (2017)

    Article  MathSciNet  Google Scholar 

  62. Wang, M., Liu, J., Fang, E.X.: Accelerating stochastic composition optimization. J. Mach. Learn. Res. 18, 1–23 (2017)

    MathSciNet  MATH  Google Scholar 

  63. Yang, S., Wang, M., Fang, E.X.: Multilevel stochastic gradient methods for nested composition optimization. SIAM J. Optim. 29(1), 616–659 (2019)

  64. Zhang, D., Tianyuan, Z., Yiping, L., Zhanxing, Z., Bin, D.: You only propagate once: Accelerating adversarial training via maximal principle. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mert Gürbüzbalaban.

Additional information

Communicated by Zaid Harchaoui.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by the National Science Foundation Awards DMS-1907522, CCF-1814888, and DMS-2053485, and by the Office of Naval Research Awards N00014-21-1-2161 and N00014-21-1-2244.

Appendices

Appendix A: Generalized Differentiability of Functions

Norkin [42] introduced the following class of functions.

Definition A.1

A function \(f:\mathbbm {R}^n\rightarrow \mathbbm {R}\) is differentiable in a generalized sense at a point \(x\in \mathbbm {R}^n\), if an open set \(U\subset \mathbbm {R}^n\) containing x, and a non-empty, convex, compact valued, and upper semicontinuous multifunction \(\hat{\partial } f: U \rightrightarrows \mathbbm {R}^n\) exist, such that for all \(y\in U\) and all \(g \in \hat{\partial } f(y)\) the following equation is true:

$$\begin{aligned} f(y) = f(x) + \langle g(y), y-x \rangle + o(x,y,g), \end{aligned}$$

with

$$\begin{aligned} \lim _{y\rightarrow x} \sup _{g\in G(y)} \frac{o(x,y,g)}{\Vert y-x\Vert }=0. \end{aligned}$$

The set \(\hat{\partial } f(y)\) is the generalized subdifferential of f at y. If a function is differentiable in a generalized sense at every \(x \in \mathbbm {R}^n\) with the same generalized subdifferential mapping \(\hat{\partial } f:\mathbbm {R}^n\rightrightarrows \mathbbm {R}^n\), we call it differentiable in a generalized sense.

A function \(f:\mathbbm {R}^n\rightarrow \mathbbm {R}^m\) is differentiable in a generalized sense, if each of its component functions, \(f_i:\mathbbm {R}^n\rightarrow \mathbbm {R}\), \(i=1,\dots ,m\), has this property.

The class of such functions is contained in the set of locally Lipschitz functions and contains all subdifferentially regular functions [7], Whitney stratifiable Lipschitz functions [11], semismooth functions [39], and their compositions. The Clarke subdifferential \(\partial \! f(x)\) is an inclusion-minimal generalized subdifferential, but the generalized sub-differential mapping \(\hat{\partial } f(\cdot )\) is not uniquely defined in Definition A.1. However, if \(f:\mathbbm {R}^n\rightarrow \mathbbm {R}\) is differentiable in a generalized sense, then for almost all \(x\in \mathbbm {R}^n\) we have \(\hat{\partial } f(x)=\{\nabla f(x)\}\).

Compositions of generalized differentiable functions are crucial in our analysis.

Theorem A.1

[40, Thm. 1.6] If \(h:\mathbbm {R}^m \rightarrow \mathbbm {R}\) and \(f_i:\mathbbm {R}^n\rightarrow \mathbbm {R}\), \(i=1,\dots ,m\), are differentiable in a generalized sense, then the composition \(\psi (x) = h\big ( f_1(x),\dots ,f_m(x)\big )\) is differentiable in a generalized sense, and at any point \(x\in \mathbbm {R}^n\) we can define the generalized subdifferential of \(\psi \) as follows:

$$\begin{aligned} \hat{\partial } \psi (x) = \text {conv} \big \{ g\in \mathbbm {R}^n: g = \begin{bmatrix} g_1&\cdots&g_m\end{bmatrix} g_0,\\ \text { with } g_0\in \hat{\partial }{h}\big (f_1(x),\dots ,f_m(x)\big ) \text { and } g_j\in \hat{\partial }{f_j}(x),\ j=1,\dots ,m\big \}. \end{aligned}$$

Even if we take \(\hat{\partial }{h}(\cdot )=\partial h(\cdot )\) and \(\hat{\partial }{f_j}(\cdot )=\partial \! f_j(\cdot )\), \(j=1,\dots ,m\), we may obtain \(\hat{\partial }\psi (\cdot ) \ne \partial \psi (\cdot )\), but \(\hat{\partial }\psi \) defined above satisfies Definition A.1.

For stochastic optimization, essential is the closure of the class functions differentiable in a generalized sense with respect to expectation.

Theorem A.2

[40, Thm. 23.1] Suppose \((\varOmega ,\mathcal {F},P)\) is a probability space and a function \(f:\mathbbm {R}^n\times \varOmega \rightarrow \mathbbm {R}\) is differentiable in a generalized sense with respect to x for all \(\omega \in \varOmega \) and integrable with respect to \(\omega \) for all \(x\in \mathbbm {R}^n\). Let \(\hat{\partial } f: \mathbbm {R}^n \times \varOmega \rightrightarrows \mathbbm {R}^n\) be a multifunction, which is measurable with respect to \(\omega \) for all \(x\in \mathbbm {R}^n\), and which is a generalized subdifferential mapping of \(f(\cdot ,\omega )\) for all \(\omega \in \varOmega \). If for every compact set \(K\subset \mathbbm {R}^n\) an integrable function \(L_K:\varOmega \rightarrow \mathbbm {R}\) exists, such that \(\sup _{x\in K}\sup _{g\in \hat{\partial } f(x,\omega )}\Vert g\Vert \le L_K(\omega )\), \(\omega \in \varOmega \), then the function

$$\begin{aligned} F(x) = \int _\varOmega f(x,\omega )\;P(d\omega ),\quad x\in \mathbbm {R}^n, \end{aligned}$$

is differentiable in a generalized sense, and the multifunction

$$\begin{aligned} \hat{\partial } F(x) = \int _\varOmega \hat{\partial } f(x,\omega )\;P(d\omega ),\quad x\in \mathbbm {R}^n, \end{aligned}$$

is its generalized subdifferential mapping.

A key step in the analysis of stochastic recursive algorithms by the differential inclusion method is the chain rule on a path (see [9] and the references therein). For an absolutely continuous function \(p:[0,\infty )\rightarrow \mathbbm {R}^n\), we denote by \(\overset{{{\;\,}_\bullet }}{p}(\cdot )\) its weak derivative: a measurable function such that

$$\begin{aligned} p(t) = p(0) + \int _0^t \overset{{{\;\,}_\bullet }}{p}(s)\;{\text {d}}s,\quad \forall \; t \ge 0. \end{aligned}$$

Theorem A.3

[49, Thm. 1] If a function \(f:\mathbbm {R}^n \rightarrow \mathbbm {R}^m\) and a path \(p:[0,\infty )\rightarrow \mathbbm {R}^n\) are differentiable in a generalized sense, then

$$\begin{aligned} f(p(T))- f(p(0)) = \int _0^T g(p(t)) \, \overset{{{\;\,}_\bullet }}{p}(t) \;{\text {d}}t, \end{aligned}$$
(42)

for all selections \(g(\cdot ) \in \hat{\partial } f(\cdot )\), and all \(T>0\).

Appendix B: Proof of Lemma 3.3

Proof

Formula (13) and assumptions (A4)(ii) and (iii) yield:

$$\begin{aligned} u^{k+1} = u^k + \tau _k \big [J^{k+1} \big (\bar{y}(x^k,z^k)-x^k\big ) + b \big ( h(x^{k+1})- u^k\big )\big ] + \tau _k \theta _u^{k+1} + \tau _k \epsilon _u^{k+1}, \end{aligned}$$
(43)

with the errors

$$\begin{aligned} \theta _u^{k+1} = E^{k+1}\big (\bar{y}(x^k,z^k)-x^k\big ) + b e_h^{k+1},\\ \epsilon _u^{k+1} = \Delta ^{k+1} \big (\bar{y}(x^k,z^k)-x^k\big ) + b \delta _h^{k+1}. \end{aligned}$$

Due to assumption (A4), for some constant \(C_u^{\theta }\),

$$\begin{aligned} \mathbbm {E}\big [\theta _u^{k+1}\,\big |\, \mathcal {F}_k\big ]=0, \quad \mathbbm {E}\big [\Vert \theta _u^{k+1}\Vert ^2\,\big |\,\mathcal {F}_k\big ]\le C_u^{\theta }, \quad k=0,1,\dots \end{aligned}$$
(44)

and

$$\begin{aligned} \lim _{k\rightarrow \infty } \epsilon _u^{k+1} = 0 \quad \text {a.s.}. \end{aligned}$$

To verify the boundedness of \(\{u^k\}\), we define the quantities

$$\begin{aligned} \tilde{u}^k = u^k + \sum _{j=k}^\infty \tau _j \theta _u^{j+1}. \end{aligned}$$

Owing to (A3) and (44), by virtue of the martingale convergence theorem, the series in the formula above is convergent a.s., and thus, \(\tilde{u}^k - u^k \rightarrow 0\) a.s., when \(k\rightarrow \infty \). We can now use (43) to establish the following recursive relation:

$$\begin{aligned} \begin{aligned} \tilde{u}^{k+1}&= (1-b\tau _k) \tilde{u}^k + b\tau _k\Big [ \frac{1}{b}J^{k+1} \big (\bar{y}(x^k,z^k)-x^k\big ) + h(x^{k+1})+ \frac{1}{b} \epsilon _u^{k+1} + (\tilde{u}^k -u^k)\Big ]. \end{aligned} \end{aligned}$$

By (A1), the sequences \(\{J^k\}\) and \(\{h(x^k)\}\) are bounded. Since \(\tilde{u}^k - u^k \rightarrow 0\) and \(\epsilon _u^{k}\rightarrow 0\) a.s., the elements in the brackets in the formula above constitute an almost surely bounded sequence. Consequently, the sequence \(\{\tilde{u}^k\}\) of their convex combinations is almost surely bounded as well. The same is true for the sequence \(\{{u}^k\}\), because \(\tilde{u}^k - u^k \rightarrow 0\) a.s.

The boundedness of \(\{z^k\}\) can be established in a similar way. We rewrite (12) as

$$\begin{aligned} z^{k+1} = z^k + a\tau _k \Big ({g}_x^{k+1}+\big [{J}^{\,k+1}\big ]^\top {g}_u^{k+1} - z^k\Big ) + a\tau _k \theta _z^{k+1} + a\tau _k \epsilon _z^{k+1}, \end{aligned}$$
(45)

with the errors

$$\begin{aligned} \theta _z^{k+1} = e_{gx}^{k+1} + \big [{J}^{\,k+1}\big ]^\top e_{gu}^{k+1} + \big [{E}^{\,k+1}\big ]^\top {g}_u^{k+1} +\big [{E}^{\,k+1}\big ]^\top e_{gu}^{k+1},\\ \epsilon _z^{k+1} = \delta _{gx}^{k+1} + \big [\tilde{J}^{\,k+1}\big ]^\top \delta _{gu}^{k+1} + \big [\Delta ^{\,k+1}\big ]^\top \tilde{g}_u^{k+1}. \end{aligned}$$

Due to assumption (A4) (note the statistical independence of \({E}^{\,k+1}\) and \( e_{gu}^{k+1}\)), for some constant \(C_z^{\theta }\),

$$\begin{aligned} \mathbbm {E}\big [\theta _z^{k+1}\,\big |\, \mathcal {F}_k\big ]=0, \quad \mathbbm {E}\big [\Vert \theta _z^{k+1}\Vert ^2\,\big |\,\mathcal {F}_k\big ]\le C_z^{\theta }, \quad k=0,1,\dots \end{aligned}$$

and

$$\begin{aligned} \lim _{k\rightarrow \infty } \epsilon _z^{k+1} = 0 \quad \text {a.s.}. \end{aligned}$$

The remaining proof is the same as that for \(\{u^k\}\), with relation (45) replacing (43). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gürbüzbalaban, M., Ruszczyński, A. & Zhu, L. A Stochastic Subgradient Method for Distributionally Robust Non-convex and Non-smooth Learning. J Optim Theory Appl 194, 1014–1041 (2022). https://doi.org/10.1007/s10957-022-02063-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-022-02063-6

Keywords

Mathematics Subject Classification

Navigation