Skip to main content

Sparse Document Analysis Using Beta-Liouville Naive Bayes with Vocabulary Knowledge

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12822))

Included in the following conference series:

Abstract

Smoothing the parameters of multinomial distributions is an important concern in statistical inference tasks. In this paper, we present a new smoothing prior for the Multinomial Naive Bayes classifier. Our approach takes advantage of the Beta-Liouville distribution for the estimation of the multinomial parameters. Dealing with sparse documents, we exploit vocabulary knowledge to define two distinct priors over the “observed” and the “unseen” words. We analyze the problem of large-scale and sparse data by enhancing Multinomial Naive Bayes classifier through smoothing the estimation of words with a Beta-scale. Our approach is evaluated on two different challenging applications with sparse and large-scale documents namely: emotion intensity analysis and hate speech detection. Experiments on real-world datasets show the effectiveness of our proposed classifier compared to the related-work methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

References

  1. Abbas, M., Memon, K.A., Jamali, A.A., Memon, S., Ahmed, A.: Multinomial Naive Bayes classification model for sentiment analysis. IJCSNS 19(3), 62 (2019)

    Google Scholar 

  2. Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. (TOIS) 20(4), 357–389 (2002)

    Article  Google Scholar 

  3. Bai, J., Nie, J.Y., Paradis, F.: Using language models for text classification. In: Proceedings of the Asia Information Retrieval Symposium, Beijing, China (2004)

    Google Scholar 

  4. Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)

    Article  Google Scholar 

  5. Bouguila, N.: A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans. Knowl. Data Eng. 21(12), 1649–1664 (2009)

    Article  Google Scholar 

  6. Bouguila, N.: Count data modeling and classification using finite mixtures of distributions. IEEE Trans. Neural Netw. 22(2), 186–198 (2010)

    Article  Google Scholar 

  7. Bouguila, N.: Infinite Liouville mixture models with application to text and texture categorization. Pattern Recognit. Lett. 33(2), 103–110 (2012)

    Article  Google Scholar 

  8. Bouguila, N.: On the smoothing of multinomial estimates using Liouville mixture models and applications. Pattern Anal. Appl. 16(3), 349–363 (2013)

    Article  MathSciNet  Google Scholar 

  9. Bouguila, N., Ghimire, M.N.: Discrete visual features modeling via leave-one-out likelihood estimation and applications. J. Vis. Commun. Image Represent. 21(7), 613–626 (2010)

    Article  Google Scholar 

  10. Bouguila, N., Ziou, D.: Unsupervised learning of a finite discrete mixture: applications to texture modeling and image databases summarization. J. Vis. Commun. Image Represent. 18(4), 295–309 (2007)

    Article  Google Scholar 

  11. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 11 (2017)

    Google Scholar 

  12. Epaillard, E., Bouguila, N.: Proportional data modeling with hidden Markov models based on generalized Dirichlet and Beta-Liouville mixtures applied to anomaly detection in public areas. Pattern Recognit. 55, 125–136 (2016)

    Article  Google Scholar 

  13. Eyheramendy, S., Lewis, D.D., Madigan, D.: On the Naive Bayes model for text categorization (2003)

    Google Scholar 

  14. Fan, W., Bouguila, N.: Learning finite Beta-Liouville mixture models via variational Bayes for proportional data clustering. In: Rossi, F. (ed.) IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013, pp. 1323–1329. IJCAI/AAAI (2013)

    Google Scholar 

  15. Fan, W., Bouguila, N.: Online learning of a Dirichlet process mixture of Beta-Liouville distributions via variational inference. IEEE Trans. Neural Networks Learn. Syst. 24(11), 1850–1862 (2013)

    Article  Google Scholar 

  16. Kadam, S., Gala, A., Gehlot, P., Kurup, A., Ghag, K.: Word embedding based multinomial Naive Bayes algorithm for spam filtering. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1–5. IEEE (2018)

    Google Scholar 

  17. Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552 (2005)

    Google Scholar 

  18. McCallum, A., Nigam, K., et al.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)

    Google Scholar 

  19. Mohammad, S., Bravo-Marquez, F.: Emotion intensities in tweets. In: Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pp. 65–77. Association for Computational Linguistics, Vancouver, Canada, August 2017

    Google Scholar 

  20. Najar, F., Bouguila, N.: Happiness analysis with fisher information of Dirichlet-multinomial mixture model. In: Goutte, C., Zhu, X. (eds.) Canadian AI 2020. LNCS (LNAI), vol. 12109, pp. 438–444. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47358-7_45

    Chapter  Google Scholar 

  21. Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pp. 616–623 (2003)

    Google Scholar 

  22. Singer, N.F.Y.: Efficient Bayesian parameter estimation in large discrete domains. Adv. Neural. Inf. Process. Syst. 11, 417 (1999)

    Google Scholar 

  23. Sivazlian, B.: On a multivariate extension of the gamma and beta distributions. SIAM J. Appl. Math. 41(2), 205–209 (1981)

    Article  MathSciNet  Google Scholar 

  24. Willems, D., Vuurpijl, L.: A Bayesian network approach to mode detection for interactive maps. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 869–873. IEEE (2007)

    Google Scholar 

  25. Wong, T.T.: Alternative prior assumptions for improving the performance of Naïve Bayesian classifiers. Data Min. Knowl. Disc. 18(2), 183–213 (2009)

    Article  Google Scholar 

  26. Xiao, Y., Lin, C., Jiang, Y., Chu, X., Shen, X.: Reputation-based QoS provisioning in cloud computing via Dirichlet multinomial model. In: 2010 IEEE International Conference on Communications, pp. 1–5. IEEE (2010)

    Google Scholar 

  27. Yuan, Q., Cong, G., Thalmann, N.M.: Enhancing Naive Bayes with various smoothing methods for short text classification. In: Proceedings of the 21st International Conference on World Wide Web, pp. 645–646 (2012)

    Google Scholar 

  28. Zamzami, N., Bouguila, N.: A novel scaled Dirichlet-based statistical framework for count data modeling: unsupervised learning and exponential approximation. Pattern Recogn. 95, 36–47 (2019)

    Article  Google Scholar 

  29. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. (TOIS) 22(2), 179–214 (2004)

    Article  Google Scholar 

  30. Zhang, J., Ghahramani, Z., Yang, Y.: A probabilistic model for online document clustering with application to novelty detection. Adv. Neural. Inf. Process. Syst. 17, 1617–1624 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fatma Najar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Najar, F., Bouguila, N. (2021). Sparse Document Analysis Using Beta-Liouville Naive Bayes with Vocabulary Knowledge. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86331-9_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86330-2

  • Online ISBN: 978-3-030-86331-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics