Skip to main content

Training Model Trees on Data Streams with Missing Values

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 584))

  • 455 Accesses

Abstract

Model trees combine the interpretability of decision trees with the efficiency of multiple linear regressions making them useful in dynamically attaining predictive analysis on data streams. However, missing values within the data streams is an issue during the training phase of a model tree. In this article, we compare different approaches to deal with incomplete streams in order to measure their impact on the resulting model tree in terms of accuracy. Moreover, we propose an online method to estimate and adjust the missing values during the stream processing. To show the results, a prototype has been developed and tested on several benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.cse.fau.edu/xqzhu/stream.html.

  2. 2.

    http://www.dcc.fc.up.pt/ltorgo/.

References

  1. Bache, K., Lichman, M.: UCI Machine Learning Repository (2013)

    Google ScholarĀ 

  2. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive Online Analysis. J. Mach. Learn. Res. 11, 1601ā€“1604 (2010)

    Google ScholarĀ 

  3. Breiman, L., et al.: Classification and Regression Trees. Chapman & Hall, New York (1984)

    MATHĀ  Google ScholarĀ 

  4. Breslow, L.A., Aha, D.W.: Simplifying decision trees: a survey. Knowl. Eng. Rev. 12(1), 1ā€“40 (1997)

    ArticleĀ  Google ScholarĀ 

  5. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547ā€“553 (2009). Smart Business Networks: Concepts and Empirical Evidence

    ArticleĀ  Google ScholarĀ 

  6. Didry, Y., Parisot, O., Tamisier, T.: Engineering data intensive applications with cadral. In: Luo, Y. (ed.) CDVE 2015. LNCS, vol. 9320, pp. 28ā€“35. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24132-6_4

    ChapterĀ  Google ScholarĀ 

  7. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71ā€“80. ACM (2000)

    Google ScholarĀ 

  8. Enders, C.K.: Applied Missing Data Analysis. Guilford Publications, New York (2010)

    Google ScholarĀ 

  9. Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41(12), 3692ā€“3705 (2008)

    ArticleĀ  MATHĀ  Google ScholarĀ 

  10. FĆ©raud, R., ClĆ©rot, F.: A methodology to explain neural network classification. Neural Networks 15(2), 237ā€“246 (2002)

    ArticleĀ  Google ScholarĀ 

  11. Fong, S., Yang, H.: The six technical gaps between intelligent applications, real-time data mining: a critical review. J. Emerg. Technol. Web Intell. 3(2), 63ā€“73 (2011)

    Google ScholarĀ 

  12. Frank, E., Mayo, M., Kramer, S.: Alternating model trees. In: 30th Annual ACM Symposium on Applied Computing, SAC 2015, pp. 871ā€“878. ACM, NY (2015)

    Google ScholarĀ 

  13. Gilbert, D.: The jfreechart class library: Developer Guide. Object Refinery 7 (2002)

    Google ScholarĀ 

  14. Hang, Y., Fong, S.: An experimental comparison of decision trees in traditional data mining and data stream mining. In: 6th International Conference on Advanced Information Management and Service (IMS), pp. 442ā€“447. IEEE (2010)

    Google ScholarĀ 

  15. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13ā€“30 (1963)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  16. Ikonomovska, E., Gama, J.: Learning model trees from data streams. In: Boulicaut, J.-F., Berthold, M.R., HorvĆ”th, T. (eds.) DS 2008. LNCS (LNAI), vol. 5255, pp. 52ā€“63. Springer, Heidelberg (2008)

    ChapterĀ  Google ScholarĀ 

  17. Ikonomovska, E., Gama, J., Džeroski, S.: Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23(1), 128ā€“168 (2011)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  18. Ikonomovska, E., Gama, J., SebastiĆ£o, R., Gjorgjevik, D.: Regression trees from data streams with drift detection. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 121ā€“135. Springer, Heidelberg (2009)

    ChapterĀ  Google ScholarĀ 

  19. Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38(18), 2895ā€“2907 (2004)

    ArticleĀ  Google ScholarĀ 

  20. Kotsiantis, S.B.: Decision trees: a recent overview. Artif. Intell. Rev. 39(4), 261ā€“283 (2013)

    ArticleĀ  Google ScholarĀ 

  21. Marwala, T., IGI Global: Computational intelligence for missing data imputation, estimation and management: knowledge optimization techniques. Information Science Reference, Herhsey (2009)

    Google ScholarĀ 

  22. MuƱoz, J., FelicĆ­simo, Ɓ.M.: Comparison of statistical methods commonly used in predictive modelling. J. Veg. Sci. 15(2), 285ā€“292 (2004)

    ArticleĀ  Google ScholarĀ 

  23. Murthy, S.K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov. 2(4), 345ā€“389 (1998)

    ArticleĀ  Google ScholarĀ 

  24. Mwale, F.D., Adeloye, A.J., Rustum, R.: Infilling of missing rainfall and streamflow data in the Shire River basin, Malawi-a SOM approach. Phys. Chem. Earth 50, 34ā€“43 (2012)

    ArticleĀ  Google ScholarĀ 

  25. Oā€™Madadhain, J., Fisher, D., White, S., Boey, Y.: The JUNG (Java Universal Network/Graph) framework. Technical report, UCI-ICS (2003)

    Google ScholarĀ 

  26. Patel, K., Mehta, R.G., Raghuvanshi, M.M., Vadnere, N.N.: Incremental missing value replacement techniques for stream data. Int. J. Comput. Appl. 122(17), 9ā€“13 (2015)

    Google ScholarĀ 

  27. Pham, N.-K., Do, T.-N., Poulet, F., Morin, A.: Treeview, exploration interactive des arbres de decision. Revue dā€™Intelligence Artificielle 22(3ā€“4), 473ā€“487 (2008)

    ArticleĀ  Google ScholarĀ 

  28. Quinlan, J.R.: Learning with continuous classes. In: 5th Australian joint Conference on Artificial Intelligence, vol. 92, pp. 343ā€“348, Singapore (1992)

    Google ScholarĀ 

  29. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581ā€“592 (1976)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  30. Saar-Tsechansky, M., Provost, F.: Handling missing values when applying classification models (2007)

    Google ScholarĀ 

  31. Shmueli, G., Koppius, O.R.: Predictive analytics in information systems research. Mis Q. 35(3), 553ā€“572 (2011)

    Google ScholarĀ 

  32. Siegel, E.V.: Competitively evolving decision trees against fixed training cases for natural language processing. Adv. Genet. Program. 19, 409ā€“423 (1994)

    Google ScholarĀ 

  33. Smith, J.D., Borckardt, J.J., Nash, M.R.: Inferential precision in single-case time-series data streams: how well does the em procedure perform when missing observations occur in autocorrelated data? Behav. Ther 43(3), 679ā€“685 (2012)

    ArticleĀ  Google ScholarĀ 

  34. Stiglic, G., Kocbek, S., Pernek, I., Kokol, P.: Comprehensive decision tree models in bioinformatics. PLoS ONE 7(3), e33812 (2012)

    ArticleĀ  Google ScholarĀ 

  35. Tfwala, S.S., Wang, Y.-M., Lin, Y.-C.: Prediction of missing flow records using multilayer perceptron and coactive neurofuzzy inference system. Sci. World J. (2013)

    Google ScholarĀ 

  36. Tran, T.T., Peng, L., Diao, Y., McGregor, A., Liu, A.: Claro: modeling and processing uncertain data streams. VLDB J. Int. J. Very Large Data Bases 21(5), 651ā€“676 (2012)

    ArticleĀ  Google ScholarĀ 

  37. Buuren, S.V.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2012)

    BookĀ  MATHĀ  Google ScholarĀ 

  38. Hulse, J.V., Khoshgoftaar, T.M.: A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J. Syst. Softw. 81(5), 691ā€“708 (2008)

    ArticleĀ  Google ScholarĀ 

  39. Walters, D.K.W., Linn, R.T., Kulas, M., Cuddihy, E., Chonghua, W., Granger, C.V.: Selecting modeling techniques for outcome prediction: Comparison of artificial neural networks, classification and regression trees, and linear regression analysis for predicting medical rehabilitation outcomes. J. Am. Med. Inform. Assoc. Suppl. S, vol. 1187 (1999)

    Google ScholarĀ 

  40. Wang, Y., Witten, I.H.: Induction of model trees for predicting continuous classes (1996)

    Google ScholarĀ 

  41. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, San Francisco (2011)

    Google ScholarĀ 

  42. Zhang, P., Zhu, X., Shi, Y., Guo, L., Xindong, W.: Robust ensemble learning for mining noisy data streams. Decis. Support Syst. 50(2), 469ā€“479 (2011)

    ArticleĀ  Google ScholarĀ 

  43. Zhu, X., Xindong, W.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177ā€“210 (2004)

    ArticleĀ  MATHĀ  Google ScholarĀ 

  44. Zhu, X., Zhang, P., Wu, X., He, D., Zhang, C., Shi, Y.: Cleansing noisy data streams. In: ICDM 2008, pp. 1139ā€“1144. IEEE (2008)

    Google ScholarĀ 

  45. Žliobaitė, I., HollmĆ©n, J.: Optimizing regression models for data streams with missing values. Mach. Learn. 99(1), 47ā€“73 (2015)

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

Download references

Acknowledgements

The project is supported by a grant from the Ministry of Economy and External Trade, Grand-Duchy of Luxembourg, under the RDI Law. Moreover, this work has been realized in partnership with the infinAIt Solutions S.A. company (http://infinait.eu), so we would like to thank Gero Vierke and Helmut Rieder for their help.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olivier Parisot .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Parisot, O., Didry, Y., Tamisier, T., Otjacques, B. (2016). Training Model Trees on Data Streams with Missing Values. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2015. Communications in Computer and Information Science, vol 584. Springer, Cham. https://doi.org/10.1007/978-3-319-30162-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30162-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30161-7

  • Online ISBN: 978-3-319-30162-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics