Adaptive (sub)gradient methods have received wide applications such as the training of deep networks. The square-root regret bounds are achieved in convex settings. However, how to exploit strong convexity for improving convergence rate and improve the generalization performance of adaptive (sub)gradient methods remain an open problem. For this reason, we devise an adaptive subgradient online learning algorithm called SAdaBoundNc in strong convexity settings. Moreover, we rigorously prove that the logarithmic regret bound can be achieved by choosing a faster diminishing learning rate. Further, we conduct various experiments to evaluate the performance of SAdaBoundNc on four real-world datasets. The results demonstrate that the training speed of SAdaBoundNc outperforms stochastic gradient descent and several adaptive gradient methods in the initial training process. Moreover, the generalization performance of SAdaBou-ndNc is also better than the current state-of-the-art methods on different datasets.

Appendix A
Proof of Theorem 1
Following from Eqs. (1) and (8), we have
Moreover, \(P_{{\mathcal {K}},\Delta _k}\left( w^{*}\right) =w^{*}\) for all \(w^{*}\in {\mathcal {K}}\). Using the non-expansiveness property of the weighted projection [28], we have
Rearranging Eq. (12), we obtain
Further, by using Cauchy–Schwarz and Young’s inequality, the term \(\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle \) in Eq. (13) can be upper bounded as follows:
When \(k=1\), \(\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle =0\) due to \(m_0=0\); When \(k\ge 2\), we have
Plugging Eqs. (14) into (13), we get
We now derive the bound of the regret of SAdaBoundNc. By utilizing the strong convexity of \(f_k\) for all \(k\in [K]\), we have
Plugging Eqs. (15) into (16), we obtain
To obtain the upper bound of Eq. (17), we first derive the following relation:
where the last inequality is obtained due to \(\lambda _{11}=\lambda _1\) and \(\lambda _{1k}\le \lambda _{1(k-1)}\) for all \(k\in [K]\). By using Eq. (18), we further get
To bound Eq. (19), \(T_{11}\) and \(T_{12}\) need to be bounded. Because Condition 1 holds, we have
where the last inequality follows from Assumption 1. For the term \(T_{12}\), we derive the following relation:
where we have utilized Eq. (7) to derive the first inequality; the second inequality is obtained by utilizing Eq. (9); the last inequality follows from \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\). Thus, we immediately get
Plugging Eqs. (20) and (22) into (19), we have
We now bound the term \(T_2\) in Eq. (17). Following from the definition of \(\delta _k\) implies that
Thus, we obtain
where the first inequality holds by \(\lambda _{1k}\le \lambda _1<1\); we use Eq. (23) to derive the last inequality. To bound \(T_2\), we need to upper bound the term \(\sum \nolimits _{k=1}^K\left\| m_k\right\| ^2/k\) in Eq. (24). Hence, by using the definition of \(m_k\) in Eq (5), we have
To further bound Eq. (26), we now start to bound the term \(T_{21}\) by using the Cauchy–Schwarz inequality as follows:
where the first inequality holds by utilizing \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\); the last inequality is derived from the inequality \(\sum \nolimits _{j=1}^K\lambda _1^{K-j}\le 1/(1-\lambda _1)\). Plugging Eqs. (27) into (26), we get
where the third inequality is due to Assumption 2; the fifth inequality is obtained by the inequality \(\sum \nolimits _{j=k}^K\lambda _1^{j-k} \le 1/(1-\lambda _1)\); the sixth inequality is derived by using the following inequality:
Hence, plugging Eqs. (28) into (25), we have
Finally, we bound the term \(T_3\) as follows:
where the last inequality follows from Assumption 1 and \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\). Plugging Eqs. (23), (30), and (31) into Eq. (17), Theorem 1 is proved. \(\Box \)
Appendix B
Proof of Corollary 1
By the definitions of bounded functions \({\underline{\alpha }}(k)\) and \({\overline{\alpha }}(k)\), we get
where we have used the relation \(1/{\overline{\alpha }}(k)\le 1/\alpha ^{\star }\) for all \(k\in [K]\) to derive the first inequality; in the last inequality, we have utilized the inequality \(\frac{1+\gamma k}{1+\gamma (k-1)}\le 1+\gamma \), which is due to \(\gamma ^2(k-1)\ge 0\) for all \(\gamma >0\) and \(k\ge 1\). Thus, for any \(\alpha ^{\star }\ge \frac{3+2\gamma ^{-1}}{\mu (1-\lambda _1)}\), the following inequality holds:
Besides, by the definition of \(\delta _t\), \(\forall i\in [d]\) and \(k\in [K]\), we get
Because \(\lambda _{1k}=\lambda _1\nu ^{k-1}\), we have
where the first inequality follows from Eq. (33); the last inequality is derived from the following inequality:
where the notation \(\partial _{\nu }\) denotes the derivative operator with respect to \(\nu \). In addition, by Eq. (33), which implies that \(\delta _{1,i}^{-1}\le \frac{1+\gamma ^{-1}}{\alpha ^{\star }}\) for all \(i\in [d]\). Thus, plugging Eqs. (34) into (10), we have
Furthermore, due to \(\alpha _{\max }={\overline{\alpha }}(1)=\alpha ^{\star }(1+\gamma ^{-1})\), Corollary 1 is proved. \(\Box \)
