Abstract
Prior to the application of a machine learning algorithm, the information has to be stored in memory which may consumes big memory amounts. If we reduce the amount of memory used to represent datasets, we can reduce the number of operations required to process it. All the libraries used to represent the information make a traditional representation (vector, matrix for example), which force to iterate over the whole dataset to obtain a result. In this paper we present a technique to process categorical data that was previously encoded in blocks of arbitrary size, the method process the data block by block which can reduces the number of iterations over the original dataset, and at the same time, the performance is similar to the traditional processing of the data. This method also requires the data to be stored in memory but in an encoded way that optimize the memory size consumed for the representation as well as the operations required to process it. The results of the experiments carried out show a slightly lower time processing than the obtained with traditional implementations, which allows us to obtain a good performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
u64 and s64 supported only in 64 bit systems.
References
Bruni, R.: Discrete models for data imputation. Discrete Appl. Math. 144(1–2), 59–69 (2004)
Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.: clMAGMA: high performance dense linear algebra with OpenCL. In: Proceedings of the International Workshop on OpenCL 2013 & 2014, pp. 1:1–1:9 (2014)
Chen, Z., Gehrke, J., Korn, F.: Query optimization in compressed database systems. ACM SIGMOD Rec. 30(2), 271–282 (2001)
Curtin, R.R., et al.: MLPACK: A Scalable C++ Machine Learning Library, pp. 1–5 (2012)
De Grande, P.: El formato redatam. ESTUDIOS DEMOGRÁFICOS Y URBANOS 31, 811–832 (2016)
Elgohary, A., Boehm, M., Haas, P.J., Reiss, F.R., Reinwald, B.: Scaling machine learning via compressed linear algebra. ACM SIGMOD Rec. 46(1), 42–49 (2017)
Feres, J.C.: XII. Medición de la pobreza a través de los censos de población y vivienda, pp. 327–335 (2010)
Rai, P., Singh, S.: A survey of clustering techniques. Int. J. Comput. Appl. 7(12), 1–5 (2010)
Rupp, K., Tillet, P., Rudolf, F., Weinbub, J.: ViennaCL—linear algebra library for multi-and many-core architectures. SIAM J. Sci. Comput. 38, S412–S439 (2016)
Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL-a high level linear algebra library for GPUs and multi-core CPUs. In: International Workshop on GPUs and Scientific Applications, pp. 51–56 (2010)
Sanderson, C., Curtin, R.: Armadillo: a template-based C++ library for linear algebra. J. Open Sour. Softw. 1, 26 (2016)
Seshadri, V., et al.: Fast bulk bitwise and and or in DRAM. IEEE Comput. Archit. Lett. 14(2), 127–131 (2015)
Tillet, P., Rupp, K., Selberherr, S.: An automatic OpenCL compute kernel generator for basic linear algebra operations. Simul. Ser. 44(6), 4:1–4:2 (2012). https://dl.acm.org/citation.cfm?id=2338820
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Salvador-Meneses, J., Ruiz-Chavez, Z., Garcia-Rodriguez, J. (2018). Categorical Big Data Processing. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11314. Springer, Cham. https://doi.org/10.1007/978-3-030-03493-1_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-03493-1_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03492-4
Online ISBN: 978-3-030-03493-1
eBook Packages: Computer ScienceComputer Science (R0)