Improving stock market volatility forecasts with complete subset linear and quantile HAR models

https://doi.org/10.1016/j.eswa.2021.115416Get rights and content

Highlights

  • We design complete subset linear (CSLR) and quantile regression (CSQR) HAR models.

  • Our approach is on the border of machine learning and standard econometric literature.

  • Our sample covers four broad market indices: S&P 500, NIKKEI 225, STOXX 50, SSEC.

  • CSLR and CSQR tend to outperform benchmark models: HAR-RV, HAR-SJ, HAR-SV, HAR-CJ.

Abstract

Volatility forecasting plays an integral role in risk management, investments and security valuation for all assets with uncertain future payoffs. We enrich the literature by presenting computationally intensive variations of the heterogeneous autoregressive (HAR) volatility model: the complete subset linear/quantile regression HAR models, HAR-CSLR and HAR-CSQR. Predictions of 1- to 22-day-ahead volatility of four major market indices (NIKKEI 225, S&P 500, SSEC and STOXX 50) show that both models tend to outperform several benchmark HAR models. Forecasting accuracy improvements tend to stabilize for longer forecasting horizons: e.g., five-day-ahead improvements range from 6.57% (SSEC) to 35.62% (NIKKEI 225) and from 3.99% (STOXX) to 9.54% for mean square error (MSE) and QLIKE loss functions. In terms of MSE, the HAR-CSQR model outperforms several standard benchmark HAR models across all market indices and forecast horizons.

Introduction

Fluctuations of asset prices are essential for pricing traded assets. It follows that volatility forecasting models play a key role in risk management, investments and security valuation for all assets with uncertain future payoffs. The volatility of asset returns tends to be highly persistent. From early on, this has led researchers to use autoregressive volatility models (e.g., the autoregressive fractionally integrated moving average (ARFIMA) model of Granger & Joyeux, 1980) and the popular generalized autoregressive conditional heteroskedasticity (GARCH) class of latent volatility models proposed by Bollerslev (1986). Many variations of these models have been proposed.2 However, as high-frequency data have become more accessible, interest has switched to volatility models that use directly observable and measurable volatility, that is, realized volatility, defined as the sum of squared intraday returns (e.g., Andersen et al., 2001a, Andersen et al., 2001b, Barndorff-Nielsen and Shephard, 2002). Among these models, the heterogeneous autoregressive model (HAR) of Corsi (2009) has become the new standard ’to beat’. Compared to GARCH models, the HAR model is simple to estimate, as the realized volatility is explained by past daily, weekly and monthly historical volatility components within a linear regression framework. Still, the model accurately captures the long-memory property of volatility. HAR models are easy to estimate and interpret and to adjust by adding new explanatory variables. For example, Patton and Sheppard (2015) included realized semivariances, Andersen et al. (2012) disentangled realized volatility into its jump and continuous components, Bollerslev et al. (2016) exploited the measurement error of the volatility, and Corsi and Reno, 2009, Horpestad et al., 2019 included asymmetric returns.

Another strand of the literature reports forecasting improvements via machine learning techniques, such as gradient descent boosting, random forest, support vector (quantile) machine, artificial neural network, and deep learning, to predict market volatility, e.g., Baruník and Křehlík, 2016, Liu, 2019, Ramos-Pérez et al., 2019 and Xu et al. (2019). Our approach is on the border of the two strands of the literature, as we use a data-driven approach that is easily tractable and interpretable if necessary.3 Specifically, the complete subset approach is a combination of feature engineering and ensemble methods used in machine learning, while the HAR model is a standard econometric approach to predict market volatility.4

In this paper, we propose a volatility forecasting model, that makes use of realized volatility quantile forecasts to determine the expected volatility. Specifically, we first predict several quantiles of the volatility density; then, we aggregate quantiles of volatility into the expected (point estimate) volatility forecast. Although in our research, volatility density is not the goal but rather a tool towards achieving point forecasts of volatility, our research is related to the scant literature on volatility density forecasting. Berkowitz (2001) argued that volatility density forecasts are important requirements for stress testing of banks, calculating margin requirements and pricing financial derivatives. Volatility density forecasts are usually based on parametric models (e.g., Corsi et al., 2008). Our nonparametric approach to density forecasts is motivated by the work of Gaglianone and Lima (2012), who used quantile regression to predict the distribution of U.S. unemployment and survey forecasts. Quantile regression is appealing, as it does not require the assumption of a parametric form of the conditional distribution of the variable of interest. Unsurprisingly, several others have followed this line of thought. For example, Manzan and Zerom (2013) predicted the distribution of U.S. inflation, and Pedersen (2015) predicted the distribution of equity and bond market returns. Moreover, Meligkotsidou et al. (2019a) have used nonparametric density forecasts to predict the equity market premium.5

As tail events are, by definition, rare, predicting quantiles, specifically, the tails of a target variable’s distribution, might lead to large forecast errors, which may deteriorate the accuracy of the predictions created from quantile forecasts. To reduce forecast errors, Meligkotsidou et al. (2019a) adapted the idea of complete subset linear regression (CSLR) of Elliott et al. (2013). CSLR forecasts the target variable by aggregating forecasts using all model specifications that are possible given a set of K explanatory variables and a number of admissible independent variables kK. For example, given a linear regression framework and K=4 potential explanatory variables, one could create 4 forecasts using models with one independent variable k=1, 6 forecasts with k=2, 4 with k=3 and 1 with k=4, i.e. K!((Kk)!k!). These forecasts can then be combined via a suitable function to obtain the point forecast of interest. Meligkotsidou et al. (2019a) adapted this approach for quantile regression, leading to complete subset quantile regression (CSQR), which we also exploit in this study. Finally, Meligkotsidou et al. (2019b) predicted the monthly level of the U.S. S&P 500 market index volatility using the CSQR approach by expanding a first-order autoregressive model with macroeconomic variables.

We contribute to the volatility literature and extend the previous studies by combining standard HAR volatility models and the CSLR (linear) and CSQR (quantile) approaches into the HAR-CSLR and HAR-CSQR volatility models. The accuracy of HAR-CSLR and HAR-CSQR is empirically tested on a sample of market indices of four large markets, the U.S. S&P 500, Japan’s NIKKEI 225, China’s SSEC Composite and Europe’s STOXX 50. We find that HAR-CSLR and HAR-CSQR models tend to outperform popular benchmark HAR models. Meligkotsidou et al. (2019b) is most closely related to this research; however, we differ in several aspects.

First, as a baseline model, we rely on several popular HAR models instead of the simple autoregressive (AR) model specification (as in Meligkotsidou et al., 2019b). We therefore contribute to the extensive literature on various HAR model types (e.g., Degiannakis et al., 2020, Patton and Sheppard, 2015). Among four standard HAR models, we cannot find (given our sample) one that performs best for all market indices, loss functions and forecast horizons, yet the approach of Patton and Sheppard (2015) tends to consistently perform well.

Second, Meligkotsidou et al. (2019b) studied monthly levels of a volatility.6 However, most of the existing volatility studies are concerned with day-ahead volatility forecasts, as shorter forecast horizons are more relevant for managing positions of risky assets. We therefore study the accuracy of the HAR-CSLR and HAR-CSQR models for 1,2, …22-day-ahead forecasts. This way we provide evidence on the usefulness of HAR-CSLR and HAR-CSQR forecasts for daily forecast periods as well as for periods leading to a monthly level of volatility that corresponds to approximately 22 trading days. We are therefore able to evaluate whether HAR-CSLR and HAR-CSQR have greater merit for shorter or longer forecasting horizons. We find, that when HAR-CSLR and HAR-CSQR models are evaluated via the mean square error loss function, they tend to outperform benchmark HAR models for longer forecast horizons. On the other hand, the asymmetric loss function tends to suggest that HAR-CSLR and HAR-CSQR work best for forecast horizons of up to nine days.

Third, Meligkotsidou et al., 2019a, Meligkotsidou et al., 2019b relied on macroeconomic data. While macroeconomic variables might be useful even when modeling short-term volatility (e.g., Lyócsa et al. 2020b), in such settings, it is difficult to evaluate the role CSLR and CSQR play in volatility forecasting, as part of the increased accuracy might be due to the use of macroeconomic variables and not the modeling approach per se. Our approach is different in that we use only data that can be retrieved from the price series, and we also model realized variance directly.7

Fourth, previous studies have employed three- or five-quantile aggregation, which has left an open question of whether aggregating across more quantiles is beneficial in practice. This factor has important practical implications, as the need to predict multiple quantiles in a CSQR framework increases the computation time. We therefore present a seven-quantile method and empirically compare three aggregation techniques (three-, five- and seven-quantile methods). As our results suggest that the five- and seven-quantile methods perform similarly, we recommend the use of the more parsimonious five-quantile method.

The remainder of this paper is organized as follows. In Section 2, we describe our sample and realized measures. In Section 3, we outline the benchmark models and the HAR-CSLR and HAR-CSQR models, along with the forecasting procedure and forecast evaluation framework. Section 4 reports our results, and Section 5 concludes and highlights further lines of research.

Section snippets

Data sources

To demonstrate the HAR-CSLR and HAR-CSQR models, we use data on four market indices corresponding to the largest markets, namely, the S&P 500 (U.S.), STOXX 50 (Europe), NIKKEI 225 (Japan), and SSEC Composite (China). Our sample starts in January 2003 and ends in March 2020. The four indices track the development of the largest stock markets in the world, which given the most recent data, correspond to approximately two thirds of the total market capitalization of the world.8

Standard predictive HAR models

The standard HAR-RV model of Corsi (2009) predicts volatility using a set of three volatility components: the average level of volatility over the past one (RVtD, daily), five (RVtW, weekly) or twenty-two (RVtM, monthly) trading days: RVt,H=β0+β1RVtD+β2RVtW+β3RVtM+ut,HWe next use the HAR-CJ model specification (similar to that of Andersen et al., 2007, Degiannakis et al., 2020, Sévi, 2014), which considers the continuous and jump (JCt) volatility components. Specifically, JCt, which is the

Baseline results

The summary of the realized measures reported in Table 1 and the visualization of the series in Fig. 2, Fig. 3 show well-known stylized facts of the volatility series. Realized volatility is skewed to the right and highly persistent. Even at the 22nd lag, the autocorrelation coefficient is 0.28 for the S&P 500 and 0.09 for the NIKKEI 225. Moreover, the continuous component (MVt) shows greater persistence, and the signed jumps (SJt) show almost no persistence. Also notable is the high

Conclusion

We extend the heterogeneous autoregressive (HAR) model of Corsi (2009) and its recent extensions (e.g., Andersen et al. 2012; Patton and Sheppard 2015) via the complete subset regression of Elliott et al. (2013) (HAR-CSLR model) and the complete subset quantile regression of (Meligkotsidou et al., 2019a, Meligkotsidou et al., 2019b) (HAR-CSQR model). The HAR-CSLR and HAR-CSQR models are empirically tested to predict the 1- to 22-day-ahead realized variance of four major market indices, the

CRediT authorship contribution statement

Štefan Lyócsa: Software, Conceptualization, Methodology, Writing - original draft, Software. Daniel Stašek: Data curation, Writing - original draft, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (54)

  • LyócsaŠtefan et al.

    Fear of the coronavirus and the stock markets

    Finance Research Letters

    (2020)
  • LyócsaŠtefan et al.

    Impact of macroeconomic news, regulation and hacking exchange markets on the volatility of bitcoin

    Journal of Economic Dynamics and Control

    (2020)
  • LyócsaŠtefan et al.

    Volatility forecasting of non-ferrous metal futures: Covariances, covariates or combinations?

    Journal of International Financial Markets, Institutions and Money

    (2017)
  • LyócsaŠtefan et al.

    Predicting risk in energy markets: Low-frequency data still matter

    Applied Energy

    (2021)
  • MaFeng et al.

    Are low-frequency data really uninformative? A forecasting combination perspective

    The North American Journal of Economics and Finance

    (2018)
  • ManzanSebastiano et al.

    Are macroeconomic variables useful for forecasting the distribution of US inflation?

    International Journal of Forecasting

    (2013)
  • MolnárPeter

    Properties of range-based volatility estimators

    International Review of Financial Analysis

    (2012)
  • PattonAndrew J.

    Volatility forecast comparison using imperfect volatility proxies

    Journal of Econometrics

    (2011)
  • PattonAndrew J. et al.

    Optimal combinations of realised volatility estimators

    International Journal of Forecasting

    (2009)
  • Ramos-PérezEduardo et al.

    Forecasting volatility with a stacked model based on a hybridized artificial neural network

    Expert Systems with Applications

    (2019)
  • SéviBenoît

    Forecasting the volatility of crude oil futures using intraday data

    European Journal of Operational Research

    (2014)
  • TaylorNick

    Realised variance forecasting under Box–Cox transformations

    International Journal of Forecasting

    (2017)
  • XuQifa et al.

    A novel UMIDAS–SVQR model with mixed frequency investor sentiment for predicting stock market volatility

    Expert Systems with Applications

    (2019)
  • AndersenTorben G. et al.

    Roughing it up: Including jump components in the measurement, modeling, and forecasting of return volatility

    The Review of Economics and Statistics

    (2007)
  • AndersenTorben G et al.

    The distribution of realized exchange rate volatility

    Journal of the American Statistical Association

    (2001)
  • Barndorff-NielsenOle E et al.

    Limit theorems for bipower variation in financial econometrics

    Econometric Theory

    (2006)
  • Barndorff-NielsenOle E et al.

    Designing realized kernels to measure the ex post variation of equity prices in the presence of noise

    Econometrica

    (2008)
  • Cited by (14)

    • Forecasting stock volatility and value-at-risk based on temporal convolutional networks

      2022, Expert Systems with Applications
      Citation Excerpt :

      Broadly speaking, these techniques can be categorized into three classes, that is, conventional GARCH-type models (Bauwens et al., 2006; Bollerslev, 1986; Bollerslev et al., 1992; Engle, 1982), stochastic volatility models (Jacquier et al., 2004; Kastner et al., 2017; Taylor, 1994) and the methods based on machine learning (Gamboa, 2017; Hou, 2013; Liu, 2019; Yu & Li, 2018). As high-frequency data become more accessible, the heterogeneous autoregressive model (HAR) (Lyócsa & Stašek, 2021) has been proposed to directly use observable and measurable volatility to predict market volatility. The GARCH-type methods use historical volatility data to predict future volatility under the assumption of conditional heteroskedasticity.

    View all citing articles on Scopus
    1

    Lyócsa appreciates the support from VEGA project, Slovakia ”Volatility density forecasts on financial markets” under Grant no. 1/0257/18.

    View full text