Categorical Big Data Processing

Salvador-Meneses, Jaime; Ruiz-Chavez, Zoila; Garcia-Rodriguez, Jose

doi:10.1007/978-3-030-03493-1_26

Jaime Salvador-Meneses¹⁷,
Zoila Ruiz-Chavez¹⁷ &
Jose Garcia-Rodriguez¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11314))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

2306 Accesses

Abstract

Prior to the application of a machine learning algorithm, the information has to be stored in memory which may consumes big memory amounts. If we reduce the amount of memory used to represent datasets, we can reduce the number of operations required to process it. All the libraries used to represent the information make a traditional representation (vector, matrix for example), which force to iterate over the whole dataset to obtain a result. In this paper we present a technique to process categorical data that was previously encoded in blocks of arbitrary size, the method process the data block by block which can reduces the number of iterations over the original dataset, and at the same time, the performance is similar to the traditional processing of the data. This method also requires the data to be stored in memory but in an encoded way that optimize the memory size consumed for the representation as well as the operations required to process it. The results of the experiments carried out show a slightly lower time processing than the obtained with traditional implementations, which allows us to obtain a good performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://developer.nvidia.com/cuda-zone.
2.
http://www.khronos.org/opencl/.
3.
https://isocpp.org/std/the-standard.
4.
http://www.boost.org/doc/libs/1_65_1/libs/numeric/ublas/doc/index.html.
5.
http://mlpack.org/index.html.
6.
u64 and s64 supported only in 64 bit systems.

References

Bruni, R.: Discrete models for data imputation. Discrete Appl. Math. 144(1–2), 59–69 (2004)
Article MathSciNet Google Scholar
Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.: clMAGMA: high performance dense linear algebra with OpenCL. In: Proceedings of the International Workshop on OpenCL 2013 & 2014, pp. 1:1–1:9 (2014)
Google Scholar
Chen, Z., Gehrke, J., Korn, F.: Query optimization in compressed database systems. ACM SIGMOD Rec. 30(2), 271–282 (2001)
Article Google Scholar
Curtin, R.R., et al.: MLPACK: A Scalable C++ Machine Learning Library, pp. 1–5 (2012)
Google Scholar
De Grande, P.: El formato redatam. ESTUDIOS DEMOGRÁFICOS Y URBANOS 31, 811–832 (2016)
Article Google Scholar
Elgohary, A., Boehm, M., Haas, P.J., Reiss, F.R., Reinwald, B.: Scaling machine learning via compressed linear algebra. ACM SIGMOD Rec. 46(1), 42–49 (2017)
Article Google Scholar
Feres, J.C.: XII. Medición de la pobreza a través de los censos de población y vivienda, pp. 327–335 (2010)
Google Scholar
Rai, P., Singh, S.: A survey of clustering techniques. Int. J. Comput. Appl. 7(12), 1–5 (2010)
Google Scholar
Rupp, K., Tillet, P., Rudolf, F., Weinbub, J.: ViennaCL—linear algebra library for multi-and many-core architectures. SIAM J. Sci. Comput. 38, S412–S439 (2016)
Article MathSciNet Google Scholar
Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL-a high level linear algebra library for GPUs and multi-core CPUs. In: International Workshop on GPUs and Scientific Applications, pp. 51–56 (2010)
Google Scholar
Sanderson, C., Curtin, R.: Armadillo: a template-based C++ library for linear algebra. J. Open Sour. Softw. 1, 26 (2016)
Article Google Scholar
Seshadri, V., et al.: Fast bulk bitwise and and or in DRAM. IEEE Comput. Archit. Lett. 14(2), 127–131 (2015)
Article Google Scholar
Tillet, P., Rupp, K., Selberherr, S.: An automatic OpenCL compute kernel generator for basic linear algebra operations. Simul. Ser. 44(6), 4:1–4:2 (2012). https://dl.acm.org/citation.cfm?id=2338820
Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Central del Ecuador, Ciudadela Universitaria, Quito, Ecuador
Jaime Salvador-Meneses & Zoila Ruiz-Chavez
Universidad de Alicante, Ap. 99., 03080, Alicante, Spain
Jose Garcia-Rodriguez

Authors

Jaime Salvador-Meneses
View author publications
You can also search for this author in PubMed Google Scholar
Zoila Ruiz-Chavez
View author publications
You can also search for this author in PubMed Google Scholar
Jose Garcia-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaime Salvador-Meneses .

Editor information

Editors and Affiliations

University of Manchester, Manchester, UK
Hujun Yin
Autonomous University of Madrid, Madrid, Spain
David Camacho
Campus of Gualtar, University of Minho, Braga, Portugal
Paulo Novais
University of Seville, Seville, Spain
Antonio J. Tallón-Ballesteros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salvador-Meneses, J., Ruiz-Chavez, Z., Garcia-Rodriguez, J. (2018). Categorical Big Data Processing. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11314. Springer, Cham. https://doi.org/10.1007/978-3-030-03493-1_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-03493-1_26
Published: 09 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03492-4
Online ISBN: 978-3-030-03493-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics