Skip to main content

Parallelization of the Poisson-Binomial Radius Distance for Comparing Histograms of n-grams

  • Conference paper
  • First Online:
Distributed Computing and Artificial Intelligence, Volume 1: 18th International Conference (DCAI 2021)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 327))

  • 349 Accesses

Abstract

Text documents are typically represented as bag-of-words in order to facilitate subsequent steps in their analysis and classification. Such a representation tends to be high-dimensional and sparse since, for each document, a histogram of its n-grams must be created by considering a global—and thereby large—vocabulary that is common to the whole collection of texts under consideration. A straightforward and powerful way to further process the documents is computing pairwise distances between their bag-of-words representations. A proper distance to compare histograms must be chosen, for instance the recently proposed Poisson-Binomial radius (PBR) distance which has shown to be very competitive in terms of accuracy but somehow computationally costly in contrast with other classic alternatives. We present a GPU-based parallelization of the PBR distance for alleviating the cost of comparing large histograms of n-grams. Our experiments were performed with publicly available datasets of n-grams and showed that speed-ups between 12 and 17 times can be achieved with respect to the sequential implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bicego, M., Londoño-Bonilla, J.M., Orozco-Alzate, M.: Volcano-seismic events classification using document classification strategies. In: Murino, V., Puppo, E. (eds.) ICIAP 2015. LNCS, vol. 9279, pp. 119–129. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23231-7_11

  2. Bramer, M.: Text mining. In: Bramer, M.: Principles of Data Mining. Undergraduate Topics in Computer Science, 3rd edn, pp. 329–343. Springer, London (2016). https://doi.org/10.1007/978-1-4471-7307-6_20

  3. Cheng, J., Grossman, M., McKercher, T.: Chapter 3: Cuda execution model. In: Cheng, J., Grossman, M., McKercher, T.: Professional CUDA C Programming, vol. 53, pp. 110–112. Wiley, Indianapolis (2013)

    Google Scholar 

  4. Ionescu, R.T., Popescu, M.: Object recognition with the bag of visual words model. Ionescu, R.T., Popescu, M.: Knowledge Transfer Between Computer Vision and Text Mining: Similarity-based Learning Approaches. ACVPR, pp. 99–132. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30367-3_5

  5. Ishiguro, K., Yamada, T., Araki, S., Nakatani, T., Sawada, H.: Probabilistic speaker diarization with bag-of-words representations of speaker angle information. IEEE Trans. Audio Speech Lang. Process. 20(2), 447–460 (2012). https://doi.org/10.1109/tasl.2011.2151858

  6. Orozco-Alzate, M.: Recent (dis)similarity measures between histograms for recognizing many classes of plant leaves: an experimental comparison. In: Tibaduiza-Burgos, D.A., Anaya Vejar, M., Pozo, F. (eds.) Pattern Recognition Applications in Engineering, Advances in Computer and Electrical Engineering, chap. 8, pp. 180–203. IGI Global, Hershey (2020). https://doi.org/10.4018/978-1-7998-1839-7.ch008

  7. Smith, S.W.: Chapter 2: Statistics, probability and noise. In: Smith, S.W.: Digital Signal Processing: A Practical Guide for Engineers and Scientists, pp. 11–34. Demystifying Technology. Newnes, Burlington (2002). https://doi.org/10.1016/b978-0-7506-7444-7/50039-x

  8. Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650. Association for Computational Linguistics, Florence (2019). https://doi.org/10.18653/v1/p19-1355

  9. Swaminathan, M., Yadav, P.K., Piloto, O., Sjöblom, T., Cheong, I.: A new distance measure for non-identical data with application to image classification. Pattern Recogn. 63, 384–396 (2017). https://doi.org/10.1016/j.patcog.2016.10.018

  10. The GDELT Project: Two new ngram datasets for exploring how television news has covered Trump and Mueller (2019). https://tinyurl.com/242jswwb

Download references

Acknowledgments

The authors acknowledge support to attend DCAI’21 provided by Facultad de Administración and “Convocatoria nacional para el apoyo a la movilidad internacional 2019–2021”, Universidad Nacional de Colombia Sede - Manizales.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ana-Lorena Uribe-Hurtado .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Uribe-Hurtado, AL., Orozco-Alzate, M. (2022). Parallelization of the Poisson-Binomial Radius Distance for Comparing Histograms of n-grams. In: Matsui, K., Omatu, S., Yigitcanlar, T., González, S.R. (eds) Distributed Computing and Artificial Intelligence, Volume 1: 18th International Conference. DCAI 2021. Lecture Notes in Networks and Systems, vol 327. Springer, Cham. https://doi.org/10.1007/978-3-030-86261-9_2

Download citation

Publish with us

Policies and ethics