Skip to main content
Log in

Multiple-cause discovery combined with structure learning for high-dimensional discrete data and application to stock prediction

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Causal discovery in observational data is crucial to a variety of scientific and business research. Although many causal discovery algorithms have been proposed in recent decades, none of them is effective enough in dealing with high-dimensional discrete data. The main challenge is the complex interactions among large volume of variables, leading to numerous spurious causalities found. In this work, we propose a novel multiple-cause discovery method combined with structure learning (McDSL) to eliminate the spurious causalities. The method is carried out in two phases. In the first phase, conditional independence test is used to distinguish direct causal candidates from the indirect ones. In the second phase, causal direction of multi-cause structure is carefully determined with a hybrid causal discovery method. Validation experiments on synthetic data showed that McDSL is reliable in discovering multi-cause structures and eliminating indirect causes. We then applied this algorithm in discovering multiple causes of stock return based on 13-year historical financial data of the Shanghai Stock Exchanges of China, and established a stock prediction model. Experimental results showed that the McDSL discovered causes revealed changes of key risk factors of the stock market over 13 years, which indicated investors should change their investment strategy over time. Moreover, the causes discovered by McDSL have better performance in predicting stock return than that of other common filter-based feature selection algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. If \(|S|,|S'|=1\), that the above definition will be transformed into the definition in article (Peters et al. 2011).

  2. \(\#\) Factor represents that Factor \(\#\) is inferred as the causes of return in training set by McDSL.

  3. ‘NoFS’ indicates no feature selection. Best results are highlighted in bold. The value in parentheses indicates the performance difference with the corresponding our algorithm. ‘Average’ is the average value of 6 algorithms on 7 baseline models.

References

  • Agbabiaka TB, Savović J, Ernst E (2008) Methods for causality assessment of adverse drug reactions. Drug Saf 310(1):21–37

    Article  Google Scholar 

  • Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD (2010) Local causal and markov blanket induction for causal discovery and feature selection for classification part i: algorithms and empirical evaluation. J Mach Learn Res 11:171–234

    MathSciNet  MATH  Google Scholar 

  • Andreu L, Aldás J, Bigné JE, Mattila AS (2010) An analysis of e-business adoption and its impact on relational quality in travel agency-supplier relationships. Tour Manag 310(6):777–787

    Article  Google Scholar 

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Cai R, Zhang Z, Hao Z (2011) Bassum: a Bayesian semi-supervised method for classification feature selection. Pattern Recognit 440(4):811–820

    Article  MATH  Google Scholar 

  • Cai R, Zhang Z, Hao Z (2013a) Causal gene identification using combinatorial v-structure search. Neural Netw 43:63–71

    Article  MATH  Google Scholar 

  • Cai R, Zhang Z, Hao Z (2013b) Sada: a general framework to support robust causation discovery. In: Proceedings of the 30th international conference on machine learning, pp 208–216

  • Chang YC, Hsieh YL, Chen CC, Hsu WL (2015) A semantic frame-based intelligent agent for topic detection. Soft Comput. doi:10.1007/s00500-015-1695-4

  • De Morais SR, Aussem A (2010) A novel Markov boundary based feature subset selection algorithm. Neurocomputing 730(4):578–584

    Article  Google Scholar 

  • Esposito C, Ficco M, Palmieri F, Castiglione A (2015) Smart cloud storage service selection based on fuzzy logic, theory of evidence and game theory. IEEE Trans Comput. doi:10.1109/TC.2015.2389952

  • Fama EF, French KR (1992) The cross-section of expected stock returns. J Financ 470(2):427–465

    Article  Google Scholar 

  • Fernandez-Lozano C, Seoane JA, Gestal M, Gaunt TR, Dorado J, Campbell C (2015) Texture classification using feature selection and kernel-based techniques. Soft Comput doi:10.1007/s00500-014-1573-5

  • Fu R, Qin B, Liu T (2015) Open-categorical text classification based on multi-lda models. Soft Comput 190(1):29–38

    Article  Google Scholar 

  • Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. In: Advances in neural information processing systems, pp 689–696

  • Kano Y, Shimizu S (2003) Causal inference using nonnormality. In: Proceedings of the international symposium on science of modeling, the 30th anniversary of the information criterion, pp 261–270

  • Karahoca A, Tunga MA (2015) A polynomial based algorithm for detection of embolism. Soft Comput 190(1):167–177

    Article  Google Scholar 

  • Koller D, Sahami M (1996) Toward optimal feature selection. Proc int conf mach Learn 20(1113):284–292

    Google Scholar 

  • Lee M-C (2009) Using support vector machine with a hybrid feature selection method to the stock trend prediction. Expert Syst Appl 360(8):10896–10904

    Article  Google Scholar 

  • Mooij J, Janzing D, Peters J, Schölkopf B (2009) Regression by dependence minimization and its application to causal inference in additive noise models. In: Proceedings of the 26th annual international conference on machine learning, pp 745–752. ACM

  • Pearl J (2000) Causality: models, reasoning and inference, vol 29. Cambridge Univ Press, Cambridge

    MATH  Google Scholar 

  • Peters J, Janzing D, Gretton A, Schölkopf B (2009) Detecting the direction of causal time series. In: Proceedings of the 26th annual international conference on machine learning, pp 801–808. ACM

  • Peters J, Janzing D, Schölkopf B (2010) Identifying cause and effect on discrete data using additive noise models. In: International conference on artificial intelligence and statistics, pp 597–604

  • Peters J, Janzing D, Scholkopf B (2011) Causal inference on discrete data using additive noise models. IEEE Trans Pattern Anal Mach Intell 330(12):2436–2450

    Article  Google Scholar 

  • Sethi R (1996) Endogenous regime switching in speculative markets. Struct Change Econ Dyn 70(1):99–118

    Article  Google Scholar 

  • Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A (2006) A linear non-gaussian acyclic model for causal discovery. J Mach Learn Res 7:2003–2030

    MathSciNet  MATH  Google Scholar 

  • Sobel ME (1996) An introduction to causal inference. Sociol Methods Res 240(3):353–379

    Article  MathSciNet  Google Scholar 

  • Spirtes P, Glymour CN, Scheines R (2000) Causation, prediction, and search, vol 81. MIT press, Cambridge

    MATH  Google Scholar 

  • Tibshirani R (1994) Regression shrinkage and selection via the lasso. J Royal Stat Soc 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Tsai C-F, Hsiao Y-C (2010) Combining multiple feature selection methods for stock prediction: union, intersection, and multi-intersection approaches. Decis Support Syst 500(1):258–269

    Article  Google Scholar 

  • Tsai C-F, Lin Y-C, Yen DC, Chen Y-M (2011) Predicting stock returns by classifier ensembles. Appl Soft Comput 110(2):2452–2459

    Article  Google Scholar 

  • Tsamardinos I, Aliferis CF, Statnikov A (2003) Time and sample efficient discovery of markov blankets and direct causal relations. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 673–678. ACM

  • Zhang J, Spirtes P (2008) Detection of unfaithfulness and robust causal inference. Minds Mach 180(2):239–271

    Article  Google Scholar 

  • Zhang X, Yong H, Xie K, Wang S, Ngai EWT, Liu M (2014) A causal feature selection algorithm for stock prediction modeling. Neurocomputing 142:48–59

    Article  Google Scholar 

  • Zhu Z, Ong Y-S, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 400(11):3236–3248

    Article  MATH  Google Scholar 

  • Zunino L, Zanin M, Tabak BM, Pérez DG, Rosso OA (2010) omplexity-entropy causality plane: A useful approach to quantify the stock market inefficiency. Phys A Stat Mech Appl 3890(9):1891–1901

    Article  Google Scholar 

  • Zuo Y, Kita E (2012) Stock price forecast using Bayesian network. Expert Syst Appl 390(8):6729–6737

    Article  Google Scholar 

Download references

Acknowledgments

This research was partly supported by the National Natural Science Foundation of China (71271061, 70801020), Science and Technology Planning Project of Guangdong Province, China (2010B010600034, 2012B091100192), Guangdong Natural Science Foundation Research Team (S2013030015737), and Business Intelligence Key Team of Guangdong University of Foreign Studies (TD1202).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiqi Chen.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, W., Hao, Z., Cai, R. et al. Multiple-cause discovery combined with structure learning for high-dimensional discrete data and application to stock prediction. Soft Comput 20, 4575–4588 (2016). https://doi.org/10.1007/s00500-015-1764-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-015-1764-8

Keywords

Navigation