Incremental and accurate computation of machine learning models with smart data summarization

Al-Amin, Sikder Tahsin; Ordonez, Carlos

doi:10.1007/s10844-021-00690-5

Incremental and accurate computation of machine learning models with smart data summarization

Published: 19 January 2022

Volume 59, pages 149–172, (2022)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Sikder Tahsin Al-Amin¹ &
Carlos Ordonez¹

450 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Nowadays, data scientists prefer “easy” high-level languages like R and Python, which accomplish complex mathematical tasks with a few lines of code, but they present memory and speed limitations. Data summarization has been a fundamental technique in data mining that has promise with more demanding data science applications. Unfortunately, most summarization approaches require reading the entire data set before computing any machine learning (ML) model, the old-fashioned way. Also, it is hard to learn models if there is an addition or removal of data samples. Keeping these motivations in mind, we present incremental algorithms to smartly compute summarization matrix, previously used in parallel DBMSs, to compute ML models incrementally in data science languages. Compared to the previous approaches, our new smart algorithms interleave model computation periodically, as the data set is being summarized. A salient feature is scalability to large data sets, provided the summarization matrix fits in RAM, a reasonable assumption in most cases. We show our incremental approach is intelligent and works for a wide spectrum of ML models. Our experimental evaluation shows models get increasingly accurate, reaching total accuracy when the data set is fully scanned. On the other hand, we show our incremental algorithms are as fast as Python ML library, and much faster than R built-in routines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Machine Learning in the R Language Using a Summarization Matrix

Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Article 08 June 2018

References

Ahmed, M. (2019). Data summarization: A survey. Knowledge and Information Systems, 58(2), 249–273.
Article Google Scholar
Al-Amin, S.T., Chebolu, S.U.S., & Ordonez, C. (2020). Extending the R language with a scalable matrix summarization operator. In IEEE International conference on big data, big data 2020 (pp. 399–405).
Altiparmak, F., Tuncel, E., & Ferhatosmanoglu, H. (2008). Incremental maintenance of online summaries over multiple streams. IEEE Transactions on Knowledge and Data Engineering, 20(2), 216–229.
Article Google Scholar
Beazley, D.M. (1996). SWIG: An easy to use tool for integrating scripting languages with C and C++. In Fourth annual USENIX tcl/tk workshop.
Bradley, P., Fayyad, U., & Reina, C. (1998). Scaling clustering algorithms to large databases. In Proc. ACM KDD conference (pp. 9–15).
Cauwenberghs, G., & Poggio, T.A. (2000). Incremental and decremental support vector machine learning. In Advances in neural information processing systems 13, papers from neural information processing systems (NIPS) 2000 (pp. 409–415). Denver: MIT Press.
Chebolu, S.U.S., Ordonez, C., & Al-Amin, S.T. (2019). Scalable machine learning in the R language using a summarization matrix. In Database and expert systems applications DEXA (pp. 247–262).
Chen, Y., Xiong, J., Xu, W., & Zuo, J. (2019). A novel online incremental and decremental learning algorithm based on variable support vector machine. Cluster Computing, 22(Supplement), 7435–7445.
Article Google Scholar
Das, S., Sismanis, Y., Beyer, K., Gemulla, R., Haas, P., & McPherson, J. (2010). RICARDO: Integrating R and hadoop. In Proc. ACM SIGMOD conference (pp. 987–998).
David, J., Pessemier, T.D., Dekoninck, L., Coensel, B.D., Joseph, W., Botteldooren, D., & Martens, L. (2020). Detection of road pavement quality using statistical clustering methods. Journal of Intelligent Information System, 54, 483–499.
Article Google Scholar
Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.
Eddelbuettel, D. (2013). Seamless r and c++ integration with rcpp. New York: Springer.
Book Google Scholar
Gepperth, A., & Hammer, B. (2016). Incremental learning algorithms and applications. In 24Th european symposium on artificial neural networks, ESANN.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning, 1st edn. New York: Springer.
Book Google Scholar
He, H., Chen, S., Li, K., & Xu, X. (2011). Incremental learning from stream data. IEEE Transactions on Neural Networks, 22(12), 1901–1914.
Article Google Scholar
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning Vol. 112. Berlin: Springer.
Book Google Scholar
Karasuyama, M., & Takeuchi, I. (2009). Multiple incremental decremental learning of support vector machines. In 23Rd annual conference on neural information processing systems 2009 (pp. 907–915).
LeCun, Y., Bengio, Y., & Hinton, G.E. (2015). Deep learning. Nature, 521(7553), 436–444.
Article Google Scholar
Levatic, J., Ceci, M., Kocev, D., & Dzeroski, S. (2017). Semi-supervised classification trees. Journal of Intelligent Information System, 49, 461–486.
Article Google Scholar
Ordonez, C., Zhang, Y., & Cabrera, W. (2016). The Gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Transactions on Knowledge and Data Engineering (TKDE), 28(7), 1906–1918.
Article Google Scholar
Ordonez, C., Zhang, Y., & Johnsson, S.L. (2019). Scalable machine learning computing a data summarization matrix with a parallel array DBMS. Distributed and Parallel Databases, 37(3), 329–350.
Article Google Scholar
Osojnik, A., Panov, P., & Dzeroski, S. (2018). Tree-based methods for online multi-target regression. Journal of Intelligent Information System, 50, 315–339.
Article Google Scholar
Patra, B.K., & Nandi, S. (2015). Effective data summarization for hierarchical clustering in large datasets. Knowledge and Information Systems, 42(1), 1–20.
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., & Michel, V. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet MATH Google Scholar
Polikar, R., Upda, L., Upda, S.S., & Honavar, V.G. (2001). Learn++: an incremental learning algorithm for supervised neural networks. IEEE Trans. Syst. Man Cybern. Part C, 31(4), 497–508.
Article Google Scholar
Ross, D.A., Lim, J., Lin, R., & Yang, M. (2008). Incremental learning for robust visual tracking. International Journal of Computer Vision, 77, 125–141.
Article Google Scholar
Rumsey, D. (2011). Statistics For Dummies. –For dummies. Hoboken: Wiley.
Google Scholar
Spokoiny, A., & Shahar, Y. (2008). Incremental application of knowledge to continuously arriving time-oriented data. Journal of Intelligent Information System 1–33.
Tari, L., Tu, P.H., Hakenberg, J., Chen, Y., Son, T.C., Gonzalez, G., & Baral, C. (2012). Incremental information extraction using relational databases. IEEE Transactions on Knowledge and Data Engineering, 24(1), 86–99.
Article Google Scholar
Totad, S.G., Geeta, R.B., & Reddy, P.V.G.D.P. (2012). Batch incremental processing for fp-tree construction using fp-growth algorithm. Knowledge and Information Systems, 33(2), 475–490.
Article Google Scholar
Zakai, A. (2011). Emscripten: an llvm-to-javascript compiler. In C.V. Lopes K. Fisher (Eds.) Companion to the 26th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications, OOPSLA (pp. 301–312). ACM.
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD conference (pp. 103–114).
Zhang, Y., Zhang, W., & Yang, J. (2010). I/o-efficient statistical computing with RIOT. In Proc. ICDE.

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, 77204, USA
Sikder Tahsin Al-Amin & Carlos Ordonez

Authors

Sikder Tahsin Al-Amin
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Ordonez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sikder Tahsin Al-Amin.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Amin, S.T., Ordonez, C. Incremental and accurate computation of machine learning models with smart data summarization. J Intell Inf Syst 59, 149–172 (2022). https://doi.org/10.1007/s10844-021-00690-5

Download citation

Received: 29 June 2021
Revised: 03 October 2021
Accepted: 22 November 2021
Published: 19 January 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s10844-021-00690-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incremental and accurate computation of machine learning models with smart data summarization

Abstract

Access this article

Similar content being viewed by others

Scalable Machine Learning in the R Language Using a Summarization Matrix

Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Incremental and accurate computation of machine learning models with smart data summarization

Abstract

Access this article

Similar content being viewed by others

Scalable Machine Learning in the R Language Using a Summarization Matrix

Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation