Skip to main content
Log in

Incremental and accurate computation of machine learning models with smart data summarization

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Nowadays, data scientists prefer “easy” high-level languages like R and Python, which accomplish complex mathematical tasks with a few lines of code, but they present memory and speed limitations. Data summarization has been a fundamental technique in data mining that has promise with more demanding data science applications. Unfortunately, most summarization approaches require reading the entire data set before computing any machine learning (ML) model, the old-fashioned way. Also, it is hard to learn models if there is an addition or removal of data samples. Keeping these motivations in mind, we present incremental algorithms to smartly compute summarization matrix, previously used in parallel DBMSs, to compute ML models incrementally in data science languages. Compared to the previous approaches, our new smart algorithms interleave model computation periodically, as the data set is being summarized. A salient feature is scalability to large data sets, provided the summarization matrix fits in RAM, a reasonable assumption in most cases. We show our incremental approach is intelligent and works for a wide spectrum of ML models. Our experimental evaluation shows models get increasingly accurate, reaching total accuracy when the data set is fully scanned. On the other hand, we show our incremental algorithms are as fast as Python ML library, and much faster than R built-in routines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Ahmed, M. (2019). Data summarization: A survey. Knowledge and Information Systems, 58(2), 249–273.

    Article  Google Scholar 

  • Al-Amin, S.T., Chebolu, S.U.S., & Ordonez, C. (2020). Extending the R language with a scalable matrix summarization operator. In IEEE International conference on big data, big data 2020 (pp. 399–405).

  • Altiparmak, F., Tuncel, E., & Ferhatosmanoglu, H. (2008). Incremental maintenance of online summaries over multiple streams. IEEE Transactions on Knowledge and Data Engineering, 20(2), 216–229.

    Article  Google Scholar 

  • Beazley, D.M. (1996). SWIG: An easy to use tool for integrating scripting languages with C and C++. In Fourth annual USENIX tcl/tk workshop.

  • Bradley, P., Fayyad, U., & Reina, C. (1998). Scaling clustering algorithms to large databases. In Proc. ACM KDD conference (pp. 9–15).

  • Cauwenberghs, G., & Poggio, T.A. (2000). Incremental and decremental support vector machine learning. In Advances in neural information processing systems 13, papers from neural information processing systems (NIPS) 2000 (pp. 409–415). Denver: MIT Press.

  • Chebolu, S.U.S., Ordonez, C., & Al-Amin, S.T. (2019). Scalable machine learning in the R language using a summarization matrix. In Database and expert systems applications DEXA (pp. 247–262).

  • Chen, Y., Xiong, J., Xu, W., & Zuo, J. (2019). A novel online incremental and decremental learning algorithm based on variable support vector machine. Cluster Computing, 22(Supplement), 7435–7445.

    Article  Google Scholar 

  • Das, S., Sismanis, Y., Beyer, K., Gemulla, R., Haas, P., & McPherson, J. (2010). RICARDO: Integrating R and hadoop. In Proc. ACM SIGMOD conference (pp. 987–998).

  • David, J., Pessemier, T.D., Dekoninck, L., Coensel, B.D., Joseph, W., Botteldooren, D., & Martens, L. (2020). Detection of road pavement quality using statistical clustering methods. Journal of Intelligent Information System, 54, 483–499.

    Article  Google Scholar 

  • Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.

  • Eddelbuettel, D. (2013). Seamless r and c++ integration with rcpp. New York: Springer.

    Book  Google Scholar 

  • Gepperth, A., & Hammer, B. (2016). Incremental learning algorithms and applications. In 24Th european symposium on artificial neural networks, ESANN.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning, 1st edn. New York: Springer.

    Book  Google Scholar 

  • He, H., Chen, S., Li, K., & Xu, X. (2011). Incremental learning from stream data. IEEE Transactions on Neural Networks, 22(12), 1901–1914.

    Article  Google Scholar 

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning Vol. 112. Berlin: Springer.

    Book  Google Scholar 

  • Karasuyama, M., & Takeuchi, I. (2009). Multiple incremental decremental learning of support vector machines. In 23Rd annual conference on neural information processing systems 2009 (pp. 907–915).

  • LeCun, Y., Bengio, Y., & Hinton, G.E. (2015). Deep learning. Nature, 521(7553), 436–444.

    Article  Google Scholar 

  • Levatic, J., Ceci, M., Kocev, D., & Dzeroski, S. (2017). Semi-supervised classification trees. Journal of Intelligent Information System, 49, 461–486.

    Article  Google Scholar 

  • Ordonez, C., Zhang, Y., & Cabrera, W. (2016). The Gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Transactions on Knowledge and Data Engineering (TKDE), 28(7), 1906–1918.

    Article  Google Scholar 

  • Ordonez, C., Zhang, Y., & Johnsson, S.L. (2019). Scalable machine learning computing a data summarization matrix with a parallel array DBMS. Distributed and Parallel Databases, 37(3), 329–350.

    Article  Google Scholar 

  • Osojnik, A., Panov, P., & Dzeroski, S. (2018). Tree-based methods for online multi-target regression. Journal of Intelligent Information System, 50, 315–339.

    Article  Google Scholar 

  • Patra, B.K., & Nandi, S. (2015). Effective data summarization for hierarchical clustering in large datasets. Knowledge and Information Systems, 42(1), 1–20.

    Article  Google Scholar 

  • Pedregosa, F., Varoquaux, G., Gramfort, A., & Michel, V. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  • Polikar, R., Upda, L., Upda, S.S., & Honavar, V.G. (2001). Learn++: an incremental learning algorithm for supervised neural networks. IEEE Trans. Syst. Man Cybern. Part C, 31(4), 497–508.

    Article  Google Scholar 

  • Ross, D.A., Lim, J., Lin, R., & Yang, M. (2008). Incremental learning for robust visual tracking. International Journal of Computer Vision, 77, 125–141.

    Article  Google Scholar 

  • Rumsey, D. (2011). Statistics For Dummies. –For dummies. Hoboken: Wiley.

    Google Scholar 

  • Spokoiny, A., & Shahar, Y. (2008). Incremental application of knowledge to continuously arriving time-oriented data. Journal of Intelligent Information System 1–33.

  • Tari, L., Tu, P.H., Hakenberg, J., Chen, Y., Son, T.C., Gonzalez, G., & Baral, C. (2012). Incremental information extraction using relational databases. IEEE Transactions on Knowledge and Data Engineering, 24(1), 86–99.

    Article  Google Scholar 

  • Totad, S.G., Geeta, R.B., & Reddy, P.V.G.D.P. (2012). Batch incremental processing for fp-tree construction using fp-growth algorithm. Knowledge and Information Systems, 33(2), 475–490.

    Article  Google Scholar 

  • Zakai, A. (2011). Emscripten: an llvm-to-javascript compiler. In C.V. Lopes K. Fisher (Eds.) Companion to the 26th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications, OOPSLA (pp. 301–312). ACM.

  • Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD conference (pp. 103–114).

  • Zhang, Y., Zhang, W., & Yang, J. (2010). I/o-efficient statistical computing with RIOT. In Proc. ICDE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sikder Tahsin Al-Amin.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Amin, S.T., Ordonez, C. Incremental and accurate computation of machine learning models with smart data summarization. J Intell Inf Syst 59, 149–172 (2022). https://doi.org/10.1007/s10844-021-00690-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-021-00690-5

Keywords

Navigation