Scalable Machine Learning in the R Language Using a Summarization Matrix

Chebolu, Siva Uday Sampreeth; Ordonez, Carlos; Al-Amin, Sikder Tahsin

doi:10.1007/978-3-030-27618-8_19

Siva Uday Sampreeth Chebolu¹⁴,
Carlos Ordonez¹⁴ &
Sikder Tahsin Al-Amin¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11707))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

731 Accesses
5 Citations

Abstract

Big data analytics generally rely on parallel processing in large computer clusters. However, this approach is not always the best. CPUs speed and RAM capacity keep growing, making small computers faster and more attractive to the analyst. Machine Learning (ML) models are generally computed on a data set, aggregating, transforming and filtering big data, which is orders of magnitude smaller than raw data. Users prefer “easy” high-level languages like R and Python, which accomplish complex analytic tasks with a few lines of code, but they present memory and speed limitations. Finally, data summarization has been a fundamental technique in data mining that has great promise with big data. With that motivation in mind, we adapt the \(\varGamma \) (Gamma) summarization matrix, previously used in parallel DBMSs, to work in the R language. \(\varGamma \) is significantly smaller than the data set, but captures fundamental statistical properties. \(\varGamma \) works well for a remarkably wide spectrum of ML models, including supervised and unsupervised models, assuming dimensions (variables) are either dependent or independent. An extensive experimental evaluation proves models on summarized data sets are accurate and their computation is significantly faster than R built-in functions. Moreover, experiments illustrate our R solution is faster and less resource hungry than competing parallel systems including a parallel DBMS and Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Al-Amin, S.T., Ordonez, C., Bellatreche, L.: Big data analytics: exploring graphs with optimized SQL queries. In: Elloumi, M., et al. (eds.) DEXA 2018. CCIS, vol. 903, pp. 88–100. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99133-7_7
Chapter Google Scholar
Chebolu, S.U.S.: A General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language. Master’s thesis, University of Houston (2019)
Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Machine Learning International Conference, vol. 20, p. 147 (2003)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, 1st edn. Springer, New York (2001). https://doi.org/10.1007/978-0-387-84858-7
Book MATH Google Scholar
Lamb, A., et al.: The vertica analytic database: C-store 7 years later. Proc. VLDB Endow. 5(12), 1790–1801 (2012)
Article MathSciNet Google Scholar
Morandat, F., Hill, B., Osvald, L., Vitek, J.: Evaluating the design of the R language. In: Noble, J. (ed.) ECOOP 2012. LNCS, vol. 7313, pp. 104–131. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31057-7_6
Chapter Google Scholar
Ordonez, C., Johnson, T., Urbanek, S., Shkapenyuk, V., Srivastava, D.: Integrating the R language runtime system with a data stream warehouse. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10439, pp. 217–231. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64471-4_18
Chapter Google Scholar
Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational databases. IEEE Trans. Knowl. Data Eng. (TKDE) 16(8), 909–921 (2004)
Article Google Scholar
Ordonez, C., Pitchaimalai, S.: Bayesian classifiers programmed in SQL. IEEE Trans. Knowl. Data Eng. (TKDE) 22(1), 139–144 (2010)
Article Google Scholar
Ordonez, C., Zhang, Y., Cabrera, W.: The Gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Trans. Knowl. Data Eng. (TKDE) 28(7), 1906–1918 (2016)
Article Google Scholar
Raychev, V., Musuvathi, M., Mytkowicz, T.: Parallelizing user-defined aggregations using symbolic execution. In: Proceedings of the 25th Symposium on Operating Systems Principles, pp. 153–167. ACM (2015)
Google Scholar
Stadler, L., Welc, A., Humer, C., Jordan, M.: Optimizing R language execution via aggressive speculation. In: Proceedings of the 12th Symposium on Dynamic Languages, DLS 2016, pp. 84–95 (2016)
Google Scholar
Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G.: SMEM algorithm for mixture models. Neural Comput. 12(9), 2109–2128 (2000)
Article Google Scholar
Vilalta, R., Rish, I.: A decomposition of classes via clustering to explain and improve naive bayes. In: Lavrač, N., Gamberger, D., Blockeel, H., Todorovski, L. (eds.) ECML 2003. LNCS (LNAI), vol. 2837, pp. 444–455. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39857-8_40
Chapter Google Scholar
Zhang, Y., Ordonez, C., Cabrera, W.: Big data analytics integrating a parallel columnar DBMS and the R language. In: Proceedings of IEEE CCGrid Conference (2016)
Google Scholar

Download references

Acknowledgements

The second author would like to thank the guidance of Simon Urbanek, from ATT Labs, to understand the R language runtime source code.

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, 77204, USA
Siva Uday Sampreeth Chebolu, Carlos Ordonez & Sikder Tahsin Al-Amin

Authors

Siva Uday Sampreeth Chebolu
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Ordonez
View author publications
You can also search for this author in PubMed Google Scholar
Sikder Tahsin Al-Amin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siva Uday Sampreeth Chebolu .

Editor information

Editors and Affiliations

Clausthal University of Technology, Clausthal-Zellerfeld, Germany
Sven Hartmann
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
The University of Texas at Arlington, Arlington, TX, USA
Sharma Chakravarthy
Johannes Kepler University of Linz, Linz, Austria
Gabriele Anderst-Kotsis
Software Competence Center Hagenberg, Hagenberg im Mühlkreis, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chebolu, S.U.S., Ordonez, C., Al-Amin, S.T. (2019). Scalable Machine Learning in the R Language Using a Summarization Matrix. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science(), vol 11707. Springer, Cham. https://doi.org/10.1007/978-3-030-27618-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-27618-8_19
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27617-1
Online ISBN: 978-3-030-27618-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics