Skip to main content
Log in

Efficient zeroth-order proximal stochastic method for nonconvex nonsmooth black-box problems

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Proximal gradient method has a major role in solving nonsmooth composite optimization problems. However, in some machine learning problems related to black-box optimization models, the proximal gradient method could not be leveraged as the derivation of explicit gradients are difficult or entirely infeasible. Several variants of zeroth-order (ZO) stochastic variance reduced such as ZO-SVRG and ZO-SPIDER algorithms have recently been studied for nonconvex optimization problems. However, almost all the existing ZO-type algorithms suffer from a slowdown and increase in function query complexities up to a small-degree polynomial of the problem size. In order to fill this void, we propose a new analysis for the stochastic gradient algorithm for optimizing nonconvex, nonsmooth finite-sum problems, called ZO-PSVRG+ and ZO-PSPIDER+. The main goal of this work is to present an analysis that brings the convergence analysis for ZO-PSVRG+ and ZO-PSPIDER+ into uniformity, recovering several existing convergence results for arbitrary minibatch sizes while improving the complexity of their ZO oracle and proximal oracle calls. We prove that the studied ZO algorithms under Polyak-Łojasiewicz condition in contrast to the existent ZO-type methods obtain a global linear convergence for a wide range of minibatch sizes when the iterate enters into a local PL region without restart and algorithmic modification. The current analysis in the literature is mainly limited to large minibatch sizes, rendering the existing methods unpractical for real-world problems due to limited computational capacity. In the empirical experiments for black-box models, we show that the new analysis provides superior performance and faster convergence to a solution of nonconvex nonsmooth problems compared to the existing ZO-type methods as they suffer from small-level stepsizes. As a byproduct, the proposed analysis is generic and can be exploited to the other variants of gradient-free variance reduction methods aiming to make them more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Not applicable.

Code availability

Available.

Notes

  1. https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/binary.html.

References

  • Allen-Zhu, Z., Yuan, Y. (2016). Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In: International conference on machine learning, pp 1080–1089.

  • Chen, P.Y., Zhang, H., Sharma, Y., et al. (2017). Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, ACM, pp. 15–26.

  • Chen, X., Liu, S., Xu, K., et al. (2019). Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. In: Advances in Neural Information Processing Systems, pp 7202–7213.

  • Defazio, A., Bach, F., Lacoste-Julien, S. (2014). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in neural information processing systems, pp 1646–1654.

  • Duchi, J. C., Jordan, M. I., Wainwright, M. J., et al. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5), 2788–2806.

    Article  MathSciNet  Google Scholar 

  • Fang, C., Li, C.J., Lin, Z., et al. (2018). Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp 689–699.

  • Flaxman, A.D., Kalai, A.T., McMahan, H.B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, pp 385–394.

  • Gao, X., Jiang, B., & Zhang, S. (2018). On the information-adaptive variants of the admm: An iteration complexity perspective. Journal of Scientific Computing, 76(1), 327–363.

    Article  MathSciNet  Google Scholar 

  • Ghadimi, S., & Lan, G. (2013). Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2341–2368.

    Article  MathSciNet  Google Scholar 

  • Ghadimi, S., & Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1–2), 59–99.

    Article  MathSciNet  Google Scholar 

  • Ghadimi, S., Lan, G., & Zhang, H. (2016). Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1), 267–305.

    Article  MathSciNet  Google Scholar 

  • Gu B, Huo Z, Deng C, et al (2018a) Faster derivative-free stochastic algorithm for shared memory machines. In: International Conference on Machine Learning, pp 1807–1816

  • Gu, B., Wang, D., Huo, Z., et al. (2018b). Inexact proximal gradient methods for non-convex and non-smooth optimization. In: Thirty-Second AAAI Conference on Artificial Intelligence.

  • Hajinezhad, D., Hong, M., Garcia, A. (2019). Zone: Zeroth order nonconvex multi-agent optimization over networks. IEEE Transactions on Automatic Control.

  • Horváth, S., Richtárik, P. (2019). Nonconvex variance reduced optimization with arbitrary sampling. In: International Conference on Machine Learning, PMLR, pp 2781–2789.

  • Huang, F., Gu, B., Huo, Z., et al (2019). Faster gradient-free proximal stochastic methods for nonconvex nonsmooth optimization. In: AAAI.

  • Huang, F., Tao, L., Chen, S. (2020). Accelerated stochastic gradient-free and projection-free methods. arXiv preprint arXiv:2007.12625.

  • Ji, K., Wang, Z., Zhou, Y., et al. (2019). Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. In: International Conference on Machine Learning, pp 3100–3109.

  • Ji, K., Wang, Z., Weng, B., et al. (2020). History-gradient aided batch size adaptation for variance reduced algorithms. In: International Conference on Machine Learning, PMLR, pp 4762–4772.

  • Johnson, R., Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in neural information processing systems, pp 315–323.

  • Karimi, H., Nutini, J., Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 795–811.

  • Lei, L., Ju, C., Chen, J., et al. (2017). Non-convex finite-sum optimization via scsg methods. Advances in Neural Information Processing Systems, pp 2348–2358.

  • Li, Z., Li, J. (2018). A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. Advances in Neural Information Processing Systems, pp 5564–5574.

  • Lian, X., Zhang, H., Hsieh, C.J., et al. (2016). A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. In: Advances in Neural Information Processing Systems, pp 3054–3062.

  • Liu, L., Cheng, M., Hsieh, C.J., et al. (2018a). Stochastic zeroth-order optimization via variance reduction method. arXiv preprint arXiv:1805.11811.

  • Liu, S., Chen, J., Chen, P.Y., et al. (2017) Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. arXiv preprint arXiv:1710.07804.

  • Liu, S., Kailkhura, B., Chen, P.Y., et al. (2018b). Zeroth-order stochastic variance reduction for nonconvex optimization. Advances in Neural Information Processing Systems, pp 3727–3737.

  • Nesterov, Y., & Spokoiny, V. (2011). Random gradient-free minimization of convex functions. Tech. rep.: Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).

  • Nesterov, Y., & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2), 527–566.

    Article  MathSciNet  Google Scholar 

  • Nguyen, L.M., Liu, J., Scheinberg, K., et al. (2017a). Sarah: A novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning, PMLR, pp 2613–2621.

  • Nguyen, L.M., Liu, J., Scheinberg, K., et al. (2017b). Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261.

  • Nitanda, A. (2016). Accelerated stochastic gradient descent for minimizing finite sums. Artificial Intelligence and Statistics, pp 195–203.

  • Parikh, N., Boyd, S., et al. (2014). Proximal algorithms. Foundations and Trends ® in Optimization, 1(3), 127–239.

    Article  Google Scholar 

  • Polyak, B. T. (1963). Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4), 643–653.

    MathSciNet  Google Scholar 

  • Reddi, S.J., Hefny, A., Sra, S., et al. (2016a). Stochastic variance reduction for nonconvex optimization. In: International conference on machine learning, pp 314–323.

  • Reddi, S.J., Sra, S., Póczos, B., et al. (2016b). Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Advances in Neural Information Processing Systems, pp 1145–1153.

  • Sahu AK, Zaheer M, Kar S (2019) Towards gradient free and projection free stochastic optimization. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp 3468–3477

  • Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52), 1–11.

    MathSciNet  Google Scholar 

  • Wang, Z., Ji, K., Zhou, Y., et al. (2019). Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, pp 2406–2416.

  • Wibisono, A., Wainwright, M.J., Jordan, M.I., et al (2012). Finite sample convergence rates of zero-order stochastic optimization methods. Advances in Neural Information Processing Systems, pp 1439–1447.

  • Xiao, L., & Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057–2075.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments that improved the manuscript significantly.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

EK and LW contributed to the theoretical results and analysis; EK performed the empirical studies; EK and LW wrote the paper; All authors contributed to the revised manuscript.

Corresponding author

Correspondence to Ehsan Kazemi.

Ethics declarations

Conflict of interest

Not applicable.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable

Additional information

Editor: Lam M Nguyen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 867 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kazemi, E., Wang, L. Efficient zeroth-order proximal stochastic method for nonconvex nonsmooth black-box problems. Mach Learn 113, 97–120 (2024). https://doi.org/10.1007/s10994-023-06409-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06409-7

Keywords

Navigation