Training Model Trees on Data Streams with Missing Values

Parisot, Olivier; Didry, Yoanne; Tamisier, Thomas; Otjacques, Benoît

doi:10.1007/978-3-319-30162-4_6

Olivier Parisot¹⁴,
Yoanne Didry¹⁴,
Thomas Tamisier¹⁴ &
…
Benoît Otjacques¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 584))

Included in the following conference series:

International Conference on Data Management Technologies and Applications

455 Accesses

Abstract

Model trees combine the interpretability of decision trees with the efficiency of multiple linear regressions making them useful in dynamically attaining predictive analysis on data streams. However, missing values within the data streams is an issue during the training phase of a model tree. In this article, we compare different approaches to deal with incomplete streams in order to measure their impact on the resulting model tree in terms of accuracy. Moreover, we propose an online method to estimate and adjust the missing values during the stream processing. To show the results, a prototype has been developed and tested on several benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bache, K., Lichman, M.: UCI Machine Learning Repository (2013)
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive Online Analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Google Scholar
Breiman, L., et al.: Classification and Regression Trees. Chapman & Hall, New York (1984)
MATH Google Scholar
Breslow, L.A., Aha, D.W.: Simplifying decision trees: a survey. Knowl. Eng. Rev. 12(1), 1–40 (1997)
Article Google Scholar
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009). Smart Business Networks: Concepts and Empirical Evidence
Article Google Scholar
Didry, Y., Parisot, O., Tamisier, T.: Engineering data intensive applications with cadral. In: Luo, Y. (ed.) CDVE 2015. LNCS, vol. 9320, pp. 28–35. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24132-6_4
Chapter Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80. ACM (2000)
Google Scholar
Enders, C.K.: Applied Missing Data Analysis. Guilford Publications, New York (2010)
Google Scholar
Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41(12), 3692–3705 (2008)
Article MATH Google Scholar
Féraud, R., Clérot, F.: A methodology to explain neural network classification. Neural Networks 15(2), 237–246 (2002)
Article Google Scholar
Fong, S., Yang, H.: The six technical gaps between intelligent applications, real-time data mining: a critical review. J. Emerg. Technol. Web Intell. 3(2), 63–73 (2011)
Google Scholar
Frank, E., Mayo, M., Kramer, S.: Alternating model trees. In: 30th Annual ACM Symposium on Applied Computing, SAC 2015, pp. 871–878. ACM, NY (2015)
Google Scholar
Gilbert, D.: The jfreechart class library: Developer Guide. Object Refinery 7 (2002)
Google Scholar
Hang, Y., Fong, S.: An experimental comparison of decision trees in traditional data mining and data stream mining. In: 6th International Conference on Advanced Information Management and Service (IMS), pp. 442–447. IEEE (2010)
Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Article MathSciNet MATH Google Scholar
Ikonomovska, E., Gama, J.: Learning model trees from data streams. In: Boulicaut, J.-F., Berthold, M.R., Horváth, T. (eds.) DS 2008. LNCS (LNAI), vol. 5255, pp. 52–63. Springer, Heidelberg (2008)
Chapter Google Scholar
Ikonomovska, E., Gama, J., Džeroski, S.: Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23(1), 128–168 (2011)
Article MathSciNet MATH Google Scholar
Ikonomovska, E., Gama, J., Sebastião, R., Gjorgjevik, D.: Regression trees from data streams with drift detection. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 121–135. Springer, Heidelberg (2009)
Chapter Google Scholar
Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38(18), 2895–2907 (2004)
Article Google Scholar
Kotsiantis, S.B.: Decision trees: a recent overview. Artif. Intell. Rev. 39(4), 261–283 (2013)
Article Google Scholar
Marwala, T., IGI Global: Computational intelligence for missing data imputation, estimation and management: knowledge optimization techniques. Information Science Reference, Herhsey (2009)
Google Scholar
Muñoz, J., Felicísimo, Á.M.: Comparison of statistical methods commonly used in predictive modelling. J. Veg. Sci. 15(2), 285–292 (2004)
Article Google Scholar
Murthy, S.K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov. 2(4), 345–389 (1998)
Article Google Scholar
Mwale, F.D., Adeloye, A.J., Rustum, R.: Infilling of missing rainfall and streamflow data in the Shire River basin, Malawi-a SOM approach. Phys. Chem. Earth 50, 34–43 (2012)
Article Google Scholar
O’Madadhain, J., Fisher, D., White, S., Boey, Y.: The JUNG (Java Universal Network/Graph) framework. Technical report, UCI-ICS (2003)
Google Scholar
Patel, K., Mehta, R.G., Raghuvanshi, M.M., Vadnere, N.N.: Incremental missing value replacement techniques for stream data. Int. J. Comput. Appl. 122(17), 9–13 (2015)
Google Scholar
Pham, N.-K., Do, T.-N., Poulet, F., Morin, A.: Treeview, exploration interactive des arbres de decision. Revue d’Intelligence Artificielle 22(3–4), 473–487 (2008)
Article Google Scholar
Quinlan, J.R.: Learning with continuous classes. In: 5th Australian joint Conference on Artificial Intelligence, vol. 92, pp. 343–348, Singapore (1992)
Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet MATH Google Scholar
Saar-Tsechansky, M., Provost, F.: Handling missing values when applying classification models (2007)
Google Scholar
Shmueli, G., Koppius, O.R.: Predictive analytics in information systems research. Mis Q. 35(3), 553–572 (2011)
Google Scholar
Siegel, E.V.: Competitively evolving decision trees against fixed training cases for natural language processing. Adv. Genet. Program. 19, 409–423 (1994)
Google Scholar
Smith, J.D., Borckardt, J.J., Nash, M.R.: Inferential precision in single-case time-series data streams: how well does the em procedure perform when missing observations occur in autocorrelated data? Behav. Ther 43(3), 679–685 (2012)
Article Google Scholar
Stiglic, G., Kocbek, S., Pernek, I., Kokol, P.: Comprehensive decision tree models in bioinformatics. PLoS ONE 7(3), e33812 (2012)
Article Google Scholar
Tfwala, S.S., Wang, Y.-M., Lin, Y.-C.: Prediction of missing flow records using multilayer perceptron and coactive neurofuzzy inference system. Sci. World J. (2013)
Google Scholar
Tran, T.T., Peng, L., Diao, Y., McGregor, A., Liu, A.: Claro: modeling and processing uncertain data streams. VLDB J. Int. J. Very Large Data Bases 21(5), 651–676 (2012)
Article Google Scholar
Buuren, S.V.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2012)
Book MATH Google Scholar
Hulse, J.V., Khoshgoftaar, T.M.: A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J. Syst. Softw. 81(5), 691–708 (2008)
Article Google Scholar
Walters, D.K.W., Linn, R.T., Kulas, M., Cuddihy, E., Chonghua, W., Granger, C.V.: Selecting modeling techniques for outcome prediction: Comparison of artificial neural networks, classification and regression trees, and linear regression analysis for predicting medical rehabilitation outcomes. J. Am. Med. Inform. Assoc. Suppl. S, vol. 1187 (1999)
Google Scholar
Wang, Y., Witten, I.H.: Induction of model trees for predicting continuous classes (1996)
Google Scholar
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, San Francisco (2011)
Google Scholar
Zhang, P., Zhu, X., Shi, Y., Guo, L., Xindong, W.: Robust ensemble learning for mining noisy data streams. Decis. Support Syst. 50(2), 469–479 (2011)
Article Google Scholar
Zhu, X., Xindong, W.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)
Article MATH Google Scholar
Zhu, X., Zhang, P., Wu, X., He, D., Zhang, C., Shi, Y.: Cleansing noisy data streams. In: ICDM 2008, pp. 1139–1144. IEEE (2008)
Google Scholar
Žliobaitė, I., Hollmén, J.: Optimizing regression models for data streams with missing values. Mach. Learn. 99(1), 47–73 (2015)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The project is supported by a grant from the Ministry of Economy and External Trade, Grand-Duchy of Luxembourg, under the RDI Law. Moreover, this work has been realized in partnership with the infinAIt Solutions S.A. company (http://infinait.eu), so we would like to thank Gero Vierke and Helmut Rieder for their help.

Author information

Authors and Affiliations

Luxembourg Institute of Science and Technology (LIST), Belvaux, Luxembourg
Olivier Parisot, Yoanne Didry, Thomas Tamisier & Benoît Otjacques

Authors

Olivier Parisot
View author publications
You can also search for this author in PubMed Google Scholar
Yoanne Didry
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Tamisier
View author publications
You can also search for this author in PubMed Google Scholar
Benoît Otjacques
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivier Parisot .

Editor information

Editors and Affiliations

School of Computing, Dublin City University, Dublin, Ireland
Markus Helfert
Human-Computer Interaction, Medical University of Graz, Graz, Austria
Andreas Holzinger
Department of Informatics, University of Minho, Braga, Portugal
Orlando Belo
Dept. of Electronics and Information, Politecnico di Milano, Milan, Italy
Chiara Francalanci

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parisot, O., Didry, Y., Tamisier, T., Otjacques, B. (2016). Training Model Trees on Data Streams with Missing Values. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2015. Communications in Computer and Information Science, vol 584. Springer, Cham. https://doi.org/10.1007/978-3-319-30162-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-30162-4_6
Published: 20 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30161-7
Online ISBN: 978-3-319-30162-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics