Skip to main content
Log in

Discrepancy-Based Theory and Algorithms for Forecasting Non-Stationary Time Series

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

We present data-dependent learning bounds for the general scenario of non-stationary non-mixing stochastic processes. Our learning guarantees are expressed in terms of a data-dependent measure of sequential complexity and a discrepancy measure that can be estimated from data under some mild assumptions. Our learning bounds guide the design of new algorithms for non-stationary time series forecasting for which we report several favorable experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Adams, T.M., Nobel, A.B.: Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. Ann. Probab. 38(4), 1345–1367 (2010)

    MathSciNet  MATH  Google Scholar 

  2. Agarwal, A., Duchi, J. C.: The generalization ability of online algorithms for dependent data. IEEE Trans. Inf. Theory 59(1), 573–587 (2013)

    MathSciNet  MATH  Google Scholar 

  3. Alquier, P., Li, X., Wintenberger, O.: Prediction of time series by statistical learning: general losses and fast rates. Depend. Modell. 1, 65–93 (2014)

    MATH  Google Scholar 

  4. Alquier, P., Wintenberger, O.: Model selection for weakly dependent time series forecasting. Technical Report 2010-39, Centre de Recherche en Economie et Statistique (2010)

  5. Andrews, D.: First Order Autoregressive Processes and Strong Mixing Cowles Foundation Discussion Papers 664, Cowles Foundation for Research in Economics, Yale University (1983)

  6. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. 1701.07875 (2017)

  7. Baillie, R.: Long memory processes and fractional integration in econometrics. J. Econ. 73(1), 5–59 (1996)

    MathSciNet  MATH  Google Scholar 

  8. Barve, R.D., Long, P.M.: On the complexity of learning from drifting distributions. In: COLT (2010)

  9. Berti, P., Rigo, P.: A Glivenko-Cantelli theorem for exchangeable random variables. Stati. Probab. Lett. 32(4), 385–391 (1997)

    MathSciNet  MATH  Google Scholar 

  10. Bollerslev, T.: Generalized autoregressive conditional heteroskedasticity. J Econometrics (1986)

  11. Box, G.E.P., Jenkins, G.: Time Series Analysis, Forecasting and Control. Holden-Day, Incorporated (1990)

  12. Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods. Springer, New York (1986)

    MATH  Google Scholar 

  13. Chen, L., Wu, W.B.: Concentration inequalities for empirical processes of linear time series. J. Mach. Learn. Res. 18(231), 1–46 (2018)

    MathSciNet  Google Scholar 

  14. De la Peña, V.H., Giné, E.: Decoupling: from dependence to independence: randomly stopped processes, U-statistics and processes, martingales and beyond. Probability and its applications. Springer, NY (1999)

  15. Doukhan, P.: Mixing: Properties and Examples. Lecture Notes in Statistics. Springer, New York (1994)

    MATH  Google Scholar 

  16. Engle, R.: Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50(4), 987–1007 (1982)

    MathSciNet  MATH  Google Scholar 

  17. Hamilton, J.D.: Time series analysis. Princeton (1994)

  18. Kuznetsov, V., Mariet, Z.: Foundations of sequence-to-sequence modeling for time series. In: AISTATS (2019)

  19. Kuznetsov, V., Mohri, M.: Generalization bounds for time series prediction with non-stationary processes. In ALT (2014)

    Google Scholar 

  20. Kuznetsov, V., Mohri, M.: Learning theory and algorithms for forecasting non-stationary time series. In: NIPS (2015)

  21. Kuznetsov, V., Mohri, M.: Time series prediction and on-line learning. In: COLT (2016)

  22. Kuznetsov, V., Mohri, M.: Discriminative state space models. In: NIPS (2017a)

  23. Kuznetsov, V., Mohri, M.: Generalization bounds for non-stationary mixing processes. Mach. Learn. 106(1), 93–117 (2017b)

    MathSciNet  MATH  Google Scholar 

  24. Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und ihrer Grenzgebiete. U.S. Government Printing Office (1991)

  25. Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning (1987)

  26. Lorenz, E. N.: Atmospheric predictability as revealed by naturally occurring analogues. J. Atmos. Sci. 26, 636–646 (1969)

    Google Scholar 

  27. Lozano, A.C., Kulkarni, S.R., Schapire, R.E.: Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. In: NIPS (2006)

  28. Meir, R.: Nonparametric time series prediction through adaptive model selection. Machine Learning 39(1), 5–34 (2000)

    MathSciNet  MATH  Google Scholar 

  29. Modha, D. S., Masry, E.: Memory-universal prediction of stationary random processes. IEEE Trans Inf Theory 44(1), 117–133 (1998)

    MathSciNet  MATH  Google Scholar 

  30. Mohri, M., Medina, A.M.: New analysis and algorithm for learning with drifting distributions. In ALT (2012)

    Google Scholar 

  31. Mohri, M., Rostamizadeh, A.: Rademacher complexity bounds for non-i.i.d. processes. In: NIPS (2009)

  32. Mohri, M., Rostamizadeh, A.: Stability bounds for stationary φ-mixing and β-mixing processes. J. Mach. Learn. Res. 11, 789–814 (2010)

    MathSciNet  MATH  Google Scholar 

  33. Pestov, V.: Predictive PAC learnability: A paradigm for learning from exchangeable input data. In: GRC (2010)

  34. Rakhlin, A., Sridharan, K., Tewari, A.: Online learning: Random averages, combinatorial parameters, and learnability. In: NIPS (2010)

  35. Rakhlin, A., Sridharan, K., Tewari, A.: Online learning: Stochastic, constrained, and smoothed adversaries. In: NIPS (2011)

  36. Rakhlin, A., Sridharan, K., Tewari, A.: Online learning via sequential complexities. JMLR 16(1), 155–186 (2015a)

    MathSciNet  MATH  Google Scholar 

  37. Rakhlin, A., Sridharan, K., Tewari, A.: Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields (2015b)

  38. Shalizi, C., Kontorovich, A.: Predictive PAC learning and process decompositions. In: NIPS (2013)

  39. Steinwart, I., Christmann, A.: Fast learning from non-i.i.d.observations. In: NIPS (2009)

  40. Tao, P.D., An, L.T.H.: A D.C. optimization algorithm for solving the trust-region subproblem. SIAM J. Optim. 8(2), 476–505 (1998)

    MathSciNet  MATH  Google Scholar 

  41. Vidyasagar, M.: A Theory of Learning and Generalization: With applications to neural networks and control systems. Springer, New York (1997)

  42. Yu, B.: Rates of convergence for empirical processes of stationary mixing sequences. Ann. Probab. 22(1), 94–116 (1994)

    MathSciNet  MATH  Google Scholar 

  43. Zhao, Z., Giannakis, D.: Analog forecasting with dynamics-adapted kernels. CoRR, arXiv:1412.3831 (2016)

  44. Zimin, A., Lampert, C.: Learning Theory for Conditional Risk Minimization. In: AISTAT (2017)

Download references

Acknowledgements

This work was partly funded by NSF CCF-1535987, IIS-1618662, and a Google Research Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vitaly Kuznetsov.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Proofs

In the construction described in Section 2.1, we denoted by z the tree defined using Zts and denoted by \(\mathcal {T}\) the distribution of z. Here, we will also denote by \(\mathbf {z}^{\prime }\) the tree formed by \(Z^{\prime }_{t}\)s and denote by \(\overline {\mathcal {T}}\) the joint distribution of \((\mathbf {z}, \mathbf {z}^{\prime })\).

Lemma 2

Let\({\mathbf {Z}_{1}^{T}}\)be a sequence of random variables and let\(\mathbf {Z}_{1}^{\prime T}\)be a decoupled tangent sequence. Then, for any measurable function G, the following equality holds:

$$ \mathbb{E}\left[G\left( \sup_{f \in \mathcal{F}} \sum\limits_{t = 1}^{T} q_{t} (f(Z^{\prime}_{t}) - f(Z_{t}))\right)\right] = \mathbb{E}_{\boldsymbol{\sigma}} \mathbb{E}_{(\mathbf{z}, \mathbf{z}^{\prime}) \sim \overline{\mathcal{T}}}\left[G\left( \sup_{f} \sum\limits_{t = 1}^{T} \sigma_{t} q_{t} (f(\mathbf{z}^{\prime}_{t}(\boldsymbol{\sigma})) - f(\mathbf{z}_{t}(\boldsymbol{\sigma}))) \right)\right]. $$
(19)

The result also holds with the absolute value around the sums in (19).

Proof

The proof follows an argument invoked in the proof of Theorem 3 of [35]. We only need to check that every step holds for an arbitrary weight vector q, in lieu of the uniform distribution vector u, and for an arbitrary measurable function G, instead of the identity function. Let p denote the joint distribution of the random variables Zts. Observe that we can write the left-hand side of (19) as follows:

$$ \begin{array}{@{}rcl@{}} \mathbb{E}\left[ G\left( \sup_{f \in \mathcal{F}} {\Sigma}(\boldsymbol{\sigma}) \right) \right] = \mathbb{E}_{Z_{1}, Z^{\prime}_{1} \sim \mathbf{p}_{1}} \mathbb{E}_{Z_{2}, Z^{\prime}_{2} \sim \mathbf{p}_{2}(\cdot|Z_{1})} {\cdots} \mathbb{E}_{Z_{T}, Z^{\prime}_{T} \sim \mathbf{p}_{T}(\cdot|\mathbf{Z}_{1}^{T-1})} \left[G\left( \sup_{f \in \mathcal{F}} {\Sigma}(\boldsymbol{\sigma}) \right) \right], \end{array} $$

where σ = (1,…, 1) ∈{± 1}T and \({\Sigma }(\boldsymbol {\sigma }) = {\sum }_{t = 1}^{T} \sigma _{t} q_{t} (f(Z^{\prime }_{t}) - f(Z_{t}))\). Now, by definition of decoupled tangent sequences, the value of the last expression is unchanged if we swap the sign of any σi− 1 to − 1 since that is equivalent to permuting Zi and \(Z^{\prime }_{i}\). Thus, the last expression is in fact equal to

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{Z_{1}, Z^{\prime}_{1} \sim \mathbf{p}_{1}} \mathbb{E}_{Z_{2}, Z^{\prime}_{2} \sim \mathbf{p}_{2}(\cdot|S_{1}(\sigma_{1}))} {\cdots} \mathbb{E}_{Z_{T}, Z^{\prime}_{T} \sim \mathbf{p}_{T}(\cdot|S_{1}(\sigma_{1}), \ldots, S_{T-1}(\sigma_{T-1}))} \left[G\left( \sup_{f \in \mathcal{F}} {\Sigma}(\boldsymbol{\sigma}) \right) \right] \end{array} $$

for any sequence σ ∈{± 1}T, where St(1) = Zt and \(Z^{\prime }_{t}\) otherwise. Since this equality holds for any σ, it also holds for the mean with respect to uniformly distributed σ. Therefore, the last expression is equal to

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\boldsymbol{\sigma}} \mathbb{E}_{Z_{1}, Z^{\prime}_{1} \sim \mathbf{p}_{1}} \mathbb{E}_{Z_{2}, Z^{\prime}_{2} \sim \mathbf{p}_{2}(\cdot|S_{1}(\sigma_{1}))} {\cdots} \mathbb{E}_{Z_{T}, Z^{\prime}_{T} \sim \mathbf{p}_{T}(\cdot|S_{1}(\sigma_{1}), \ldots, S_{T-1}(\sigma_{T-1}))} \left[G\left( \sup_{f \in \mathcal{F}} {\Sigma}(\boldsymbol{\sigma}) \right) \right]. \end{array} $$

This last expectation coincides with the expectation with respect to drawing a random tree z and its tangent tree \(\mathbf {z}^{\prime }\) according to \(\overline {\mathcal {T}}\) and a random path σ to follow in that tree. That is, the last expectation is equal to

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{\boldsymbol{\sigma}} \mathbb{E}_{(\mathbf{z}, \mathbf{z}^{\prime}) \sim \overline{\mathcal{T}}}\left[G\left( \sup_{f} \sum\limits_{t = 1}^{T} \sigma_{t} q_{t} (f(\mathbf{z}^{\prime}_{t}(\boldsymbol{\sigma})) - f(\mathbf{z}_{t}(\boldsymbol{\sigma}))) \right)\right], \end{array} $$

which concludes the proof. □

Theorem 6

Let p ≥ 1 and\(\mathcal {F} = \{(\mathbf {x}, y) \to (\mathbf {w} \cdot {\Psi }(\mathbf {x}) - y)^{p} \colon \|\mathbf {w}\|_{\mathbb {H}} \leq {\Lambda }\}\) where\(\mathbb {H}\)is a Hilbert space and\({\Psi }\colon \mathcal {X} \to \mathbb {H}\)a feature map. Assume that the condition |wxy|≤ Mholds for all\((\mathbf {x}, y)\in \mathcal {Z}\)and allwsuch that\(\|\mathbf {w}\|_{\mathbb {H}} \leq {\Lambda }\). Fixq. Then, if\({\mathbf {Z}_{1}^{T}}=({\mathbf {X}_{1}^{T}},{\mathbf {Y}_{1}^{T}})\) is a sequence of random variables, for any δ > 0, with probability at least 1 − δ, the following bound holds for all\(h \in {\mathcal{H}} = \{\mathbf {x} \to \mathbf {w} \cdot {\Psi }(\mathbf {x}) \colon \|\mathbf {w}\|_{\mathbb {H}} \leq {\Lambda }\}\)and allqsuch that 0 < ∥qq1 ≤ 1:

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[(h(X_{T+1}) - Y_{T+1})^{p}|{\mathbf{Z}_{1}^{T}}] \leq \sum\limits_{t = 1}^{T} q_{t} (h(X_{t}) - Y_{t})^{p} + \text{disc}(\mathbf{q}) + G(\mathbf{q}) + 4M \|\mathbf{q} - \mathbf{q}^{*}\|_{1} \end{array} $$

where\(G(\mathbf {q}) = 4M \left (\sqrt {8 \log \frac {2}{\delta }} + \sqrt {2 \log \log _{2} 2 (1 - \|\mathbf {q} - \mathbf {q}^{*}\|_{1})^{-1}} + \widetilde {C}_{T} {\Lambda } r \right ) \left (\|\mathbf {q}^{*}\|_{2} + 2 \|\mathbf {q}\right .\)\(\left .- \mathbf {q}^{*}\|_{1}\right )\)and\(\widetilde {C}_{T} = 48 pM^{p} \sqrt {4 \pi \log T} (1 + 4\sqrt {2} \log ^{3/2} (eT^{2}))\). Thus, for p = 2,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[(h(X_{T+1}) &-& Y_{T+1})^{2}|{\mathbf{Z}_{1}^{T}}] \leq {\sum}_{t = 1}^{T} q_{t} (h(X_{t}) - Y_{t})^{2} + \text{disc}(\mathbf{q}) \\ & +& O\left( {\Lambda} r (\log^{2} T) \sqrt{\log \log_{2} 2(1- \|\mathbf{q} -\mathbf{q}^{*}\|_{1})^{-1}} \left( \|\mathbf{q}^{*}\|_{2} + \|\mathbf{q} - \mathbf{q}^{*}\|_{1} \right)\right). \end{array} $$

This result extends Theorem 4 to hold uniformly over q. Similarly, one can prove an analogous extension for Theorem 1. This result suggests that we should try to minimize \({\sum }_{t = 1}^{T} q_{t} f(Z_{t}) + \text {disc}(\mathbf {q})\) over q and w. This bound is in certain sense analogous to margin bounds: it is the most favorable when there exists a good choice for q and we hope to find q that is going to be close to this weight vector. These insights are used to develop our algorithmic solutions for forecasting non-stationary time series in Section 6.

Proof

Let \((\epsilon _{k})_{k=0}^{\infty }\) and \((\mathbf {q}(k))_{k=0}^{\infty }\) be infinite sequences specified below. By Theorem 4, the following holds for each k

$$ \begin{array}{@{}rcl@{}} \mathbb{P} \left( \mathbb{E}[f(Z_{T + 1})|{\mathbf{Z}_{1}^{T}}] > \sum\limits_{t = 1}^{T} q_{t}(k) f(Z_{t}) + {\Delta}(\mathbf{q}(k)) + C(\mathbf{q}(k)) + 4 M \|\mathbf{q}\|_{2} \epsilon_{k} \right) \leq \exp(-{\epsilon_{k}^{2}} ), \end{array} $$

where Δ(q(k)) denotes the discrepancy computed with respect to the weights q(k) and \(C(\mathbf {q}(k)) = \widetilde {C}_{T} \|\mathbf {q}(k)\|_{2}\). Let \(\epsilon _{k} = \epsilon + \sqrt {2 \log k}\). Then, by the union bound we can write

$$ \begin{array}{@{}rcl@{}} \mathbb{P}\left( \exists k \colon \! \mathbb{E}[f(Z_{T + 1}) |{\mathbf{Z}_{1}^{T}}]\! >\! \sum\limits_{t = 1}^{T} q_{t}(k) f(Z_{t})\! +\! {\Delta}(\mathbf{q}(k))\! +\! C(\mathbf{q}(k))\! + \! 4M \|\mathbf{q}(k)\|_{2} \epsilon_{k} \right) & \!\leq\!& \sum\limits_{k=1}^{\infty} e^{-{\epsilon_{k}^{2}}} \\ & \!\leq\!& \sum\limits_{k=1}^{\infty} e^{-\epsilon^{2} - \log k^{2}} \\ & \!\leq\!& 2 e^{-\epsilon^{2}}. \end{array} $$

We choose the sequence q(k) to satisfy ∥q(k) −q1 = 1 − 2k. Then, for any q such that 0 < ∥qu1 ≤ 1, there exists k ≥ 1 such that

$$ \begin{array}{@{}rcl@{}} 1 - \|\mathbf{q}(k) - \mathbf{q}^{*}\|_{1} < 1 - \|\mathbf{q} - \mathbf{q}^{*}\|_{1} \leq 1- \|\mathbf{q}(k-1) - \mathbf{q}^{*}\|_{1} \leq 2(1 - \| \mathbf{q}(k) - \mathbf{q}^{*}\|_{1}). \end{array} $$

Thus, the following inequality holds:

$$ \begin{array}{@{}rcl@{}} \sqrt{2\log k} \leq \sqrt{2\log \log_{2} 2(1- \|\mathbf{q} - \mathbf{q}^{*}\|_{1})^{-1}}. \end{array} $$

Combining this with the observation that the following two inequalities hold:

$$ \begin{array}{@{}rcl@{}} \sum\limits_{t = 1}^{T} q_{t}(k-1) f(Z_{t}) & \leq& \sum\limits_{t = 1}^{T} q_{t} f(Z_{t}) + 2 M \|\mathbf{q} - \mathbf{q}^{*}\|_{1}\\ {\Delta}(\mathbf{q}(k-1)) & \leq& {\Delta}(\mathbf{q}) + 2 M \|\mathbf{q} - \mathbf{q}^{*}\|_{1}, \\ \|\mathbf{q}(k-1)\|_{2} & \leq& 2 \| \mathbf{q} - \mathbf{q}^{*}\|_{1} + \|\mathbf{q}^{*}\|_{2} \end{array} $$

shows that the event

$$ \begin{array}{@{}rcl@{}} \left\{ \mathbb{E}[f(Z_{T + 1})|{\mathbf{Z}_{1}^{T}}] > \sum\limits_{t = 1}^{T} q_{t} f(Z_{t}) + \text{disc}(\mathbf{q}) + G(\mathbf{q}) + 4M \|\mathbf{q} - \mathbf{q}^{*}\|_{1} \right\} \end{array} $$

where \(G(\mathbf {q}) = 4M \left (\epsilon + \sqrt {2 \log \log _{2} 2 (1 - \|\mathbf {q} - \mathbf {q}^{*}\|_{1})^{-1}} + \widetilde {C}_{T} {\Lambda } r \right ) \left (\|\mathbf {q}^{*}\|_{2} + 2 \|\mathbf {q} - \mathbf {q}^{*}\|_{1}\right )\) implies the following one

$$ \begin{array}{@{}rcl@{}} \left\{ \mathbb{E}[f(Z_{T + 1}) |{\mathbf{Z}_{1}^{T}}]\! >\! \sum\limits_{t = 1}^{T} q_{t}(k-1) f(Z_{t})\! +\! {\Delta}(\mathbf{q}(k-1))\! +\! C(\mathbf{q}(k-1))\! + \! 4M \|\mathbf{q}(k-1)\|_{2} \epsilon_{k-1} \right\}, \end{array} $$

which completes the proof. □

B Dual Optimization Problem

In this section, we provide a detailed derivation of the optimization problem in (15) starting with optimization problem in (11). The first step is to appeal to the following chain of equalities:

$$ \begin{array}{@{}rcl@{}} \min_{\mathbf{w}} {} &&\left\{ \sum\limits_{t=1}^{T} q_{t} (\mathbf{w} \cdot {\Psi}(x_{t}) - y_{t})^{2} + \lambda_{2} \|\mathbf{w}\|_{\mathbb{H}}^{2} \right\} \\ &&= \min_{\mathbf{w}} \left\{ \sum\limits_{t=1}^{T} (\mathbf{w} \cdot x^{\prime}_{t} - y^{\prime}_{t})^{2} + \lambda_{2} \|\mathbf{w}\|_{\mathbb{H}}^{2} \right\} \\ &&= \max_{\boldsymbol{\upbeta}} \left\{ -\lambda_{2} \sum\limits_{t=1}^{T} {{\upbeta}_{t}^{2}} - \sum\limits_{s,t=1}^{T} {\upbeta}_{s} {\upbeta}_{t} x^{\prime}_{s} x^{\prime}_{t} + 2 \lambda_{2} \sum\limits_{t=1} {\upbeta}_{t} y^{\prime}_{t} \right\} \\ &&= \max_{\boldsymbol{\upbeta}} \left\{ -\lambda_{2} \sum\limits_{t=1}^{T} {{\upbeta}_{t}^{2}} - \sum\limits_{s,t=1}^{T} {\upbeta}_{s} {\upbeta}_{t} \sqrt{q_{s}}\sqrt{q_{t}} K_{s,t} + 2 \lambda_{2} \sum\limits_{t=1} {\upbeta}_{t} \sqrt{q_{t}} y_{t} \right\} \\ &&= \max_{\boldsymbol{\alpha}} \left\{ -\lambda_{2} \sum\limits_{t=1}^{T} \frac{{\alpha_{t}^{2}}}{q_{t}} - \boldsymbol{\alpha}^{T} \mathbf{K} \boldsymbol{\alpha} + 2 \lambda_{2} \boldsymbol{\alpha}^{T} \mathbf{Y} \right\}, \end{array} $$
(20)

where the first equality follows by substituting \(x^{\prime }_{t} = \sqrt {q_{t}} {\Psi }(x_{t})\) and \(y^{\prime }_{t} = \sqrt {q_{t}} y_{t}\) the second equality uses the dual formulation of the kernel ridge regression problem and the last equality follows from the following change of variables: \(\alpha _{t} = \sqrt {q_{t}} {\upbeta }_{t}\).

By (20), optimization problem in (11) is equivalent to the following optimization problem

$$ \begin{array}{@{}rcl@{}} \min_{0 \leq \mathbf{q} \leq 1}\left\{ \max_{\boldsymbol{\alpha}} \left\{ -\lambda_{1} \sum\limits_{t=1}^{T} \frac{{\alpha_{t}^{2}}}{q_{t}} - \boldsymbol{\alpha}^{T} \mathbf{K} \boldsymbol{\alpha} + 2 \lambda_{1} \boldsymbol{\alpha}^{T} \mathbf{Y} \right\} + (\mathbf{d} \!\cdot\! \mathbf{q}) + \lambda_{2} \|\mathbf{q} - \mathbf{u}\|_{p} \right\}. \end{array} $$

Next, we apply the change of variables rt = 1/qt and appeal to the same arguments as were given for the primal problem in Section 6 to arrive at (15).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kuznetsov, V., Mohri, M. Discrepancy-Based Theory and Algorithms for Forecasting Non-Stationary Time Series. Ann Math Artif Intell 88, 367–399 (2020). https://doi.org/10.1007/s10472-019-09683-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-019-09683-1

Keywords

Mathematics Subject Classification (2010)

Navigation