Abstract
Text documents are typically represented as bag-of-words in order to facilitate subsequent steps in their analysis and classification. Such a representation tends to be high-dimensional and sparse since, for each document, a histogram of its n-grams must be created by considering a global—and thereby large—vocabulary that is common to the whole collection of texts under consideration. A straightforward and powerful way to further process the documents is computing pairwise distances between their bag-of-words representations. A proper distance to compare histograms must be chosen, for instance the recently proposed Poisson-Binomial radius (PBR) distance which has shown to be very competitive in terms of accuracy but somehow computationally costly in contrast with other classic alternatives. We present a GPU-based parallelization of the PBR distance for alleviating the cost of comparing large histograms of n-grams. Our experiments were performed with publicly available datasets of n-grams and showed that speed-ups between 12 and 17 times can be achieved with respect to the sequential implementation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bicego, M., Londoño-Bonilla, J.M., Orozco-Alzate, M.: Volcano-seismic events classification using document classification strategies. In: Murino, V., Puppo, E. (eds.) ICIAP 2015. LNCS, vol. 9279, pp. 119–129. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23231-7_11
Bramer, M.: Text mining. In: Bramer, M.: Principles of Data Mining. Undergraduate Topics in Computer Science, 3rd edn, pp. 329–343. Springer, London (2016). https://doi.org/10.1007/978-1-4471-7307-6_20
Cheng, J., Grossman, M., McKercher, T.: Chapter 3: Cuda execution model. In: Cheng, J., Grossman, M., McKercher, T.: Professional CUDA C Programming, vol. 53, pp. 110–112. Wiley, Indianapolis (2013)
Ionescu, R.T., Popescu, M.: Object recognition with the bag of visual words model. Ionescu, R.T., Popescu, M.: Knowledge Transfer Between Computer Vision and Text Mining: Similarity-based Learning Approaches. ACVPR, pp. 99–132. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30367-3_5
Ishiguro, K., Yamada, T., Araki, S., Nakatani, T., Sawada, H.: Probabilistic speaker diarization with bag-of-words representations of speaker angle information. IEEE Trans. Audio Speech Lang. Process. 20(2), 447–460 (2012). https://doi.org/10.1109/tasl.2011.2151858
Orozco-Alzate, M.: Recent (dis)similarity measures between histograms for recognizing many classes of plant leaves: an experimental comparison. In: Tibaduiza-Burgos, D.A., Anaya Vejar, M., Pozo, F. (eds.) Pattern Recognition Applications in Engineering, Advances in Computer and Electrical Engineering, chap. 8, pp. 180–203. IGI Global, Hershey (2020). https://doi.org/10.4018/978-1-7998-1839-7.ch008
Smith, S.W.: Chapter 2: Statistics, probability and noise. In: Smith, S.W.: Digital Signal Processing: A Practical Guide for Engineers and Scientists, pp. 11–34. Demystifying Technology. Newnes, Burlington (2002). https://doi.org/10.1016/b978-0-7506-7444-7/50039-x
Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650. Association for Computational Linguistics, Florence (2019). https://doi.org/10.18653/v1/p19-1355
Swaminathan, M., Yadav, P.K., Piloto, O., Sjöblom, T., Cheong, I.: A new distance measure for non-identical data with application to image classification. Pattern Recogn. 63, 384–396 (2017). https://doi.org/10.1016/j.patcog.2016.10.018
The GDELT Project: Two new ngram datasets for exploring how television news has covered Trump and Mueller (2019). https://tinyurl.com/242jswwb
Acknowledgments
The authors acknowledge support to attend DCAI’21 provided by Facultad de Administración and “Convocatoria nacional para el apoyo a la movilidad internacional 2019–2021”, Universidad Nacional de Colombia Sede - Manizales.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Uribe-Hurtado, AL., Orozco-Alzate, M. (2022). Parallelization of the Poisson-Binomial Radius Distance for Comparing Histograms of n-grams. In: Matsui, K., Omatu, S., Yigitcanlar, T., González, S.R. (eds) Distributed Computing and Artificial Intelligence, Volume 1: 18th International Conference. DCAI 2021. Lecture Notes in Networks and Systems, vol 327. Springer, Cham. https://doi.org/10.1007/978-3-030-86261-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-86261-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86260-2
Online ISBN: 978-3-030-86261-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)