Skip to main content

A Probabilistic Vector Representation and Neural Network for Text Classification

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 872))

Abstract

The increasing of the textual databases and its representation in large spaces prevents the automation of the treatment of these great masses and the extraction of knowledge. In order to address the challenges of high dimensionality which using the methods and technics of the text mining. Where the term frequency-inverse document frequency (TF-IDF), weighting method, is the most required approach to represent the document. Unfortunately, TF-IDF produces descriptors of large sizes (generally greater than 1000), which requires models with great complexity. However, the texts classification systems based on these models suffer from the overfitting phenomenon and are very slow. Therefore, to overcome these problems, we use the select attributes methods; by giving the deterministic aspect of this latter, we risk to lose huge information. Thus, to recover from this loss, we propose a probabilistic vector representation of each document, based on the relevant terms selected previously. Then, we associate a set of features to each document composed by local and global probabilistic coefficients basing on the selected terms. More specifically and precisely, the components formulas are composed by the frequency of each descriptor, the length of each document and the size of the corpus. To show the performance of this treatment we propose comparative studies between TF-IDF representation and the new probabilistic representation, to classify the BBCSPORT corpus. Moreover, in the classification phase, we use several versions of Bayesian Network and Multilayer Perceptron. The obtained results are satisfied, where the neural network classifier, multilayer perceptron, gives 100% as a recognition rate, using the new representation and 94.69%, using the simple TF-IDF weighting.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Ahmad, M.: Machine learning approach to text mining: a review. Int. J. 4(6), 1125–1131 (2014)

    Google Scholar 

  2. Tan, A.H.: Text mining: the state of the art and the challenges. In: Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, vol. 8, pp. 65–70. Sn (1999)

    Google Scholar 

  3. Kumar, L., Bhatia, P.K.: Text mining: concepts, process and applications. Int. J. Global Res. Comput. Sci. (UGC Approv. J.) 4(3), 36–39 (2013)

    Google Scholar 

  4. Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning. J. Comput. Sci. 98, 4–6 (1998)

    Google Scholar 

  5. Trinh, A.P.: Classification de texte et estimation probabiliste par Machine à Vecteurs de Support. Actes du troisième DÉfi Fouille de Textes, pp. 77 (2007)

    Google Scholar 

  6. Vu, T., Denoyer, L., Gallinari, P.: Un modèle statistique pour la classification de documents structurés (2003)

    Google Scholar 

  7. Eensoo, E., Nouvel, D., Martin, A., Valette, M.: Combiner analyses textométriques, apprentissage supervisé et représentation vectorielle pour l’analyse de la subjectivité. In: 11e Défi Fouille de Texte (DEFT 2015), Caen, France (2016)

    Google Scholar 

  8. Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans. Knowl. Data Eng. 20(5), 641–652 (2008)

    Article  Google Scholar 

  9. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  10. Bouckaert, R.R.: Bayesian network classifiers in weka for version 3-5-7. Artificial Intelligence Tools 11(3), 369–387 (2008)

    Google Scholar 

  11. Panchal, G., Ganatra, A., Kosta, Y.P., Panchal, D.: Behaviour analysis of multilayer perceptronswith multiple hidden neurons and hidden layers. Int. J. Comput. Theory Eng. 3(2), 332 (2011)

    Article  Google Scholar 

  12. Buscema, M., Tastle, W.J., Terzi, S.: Meta Net: a new meta-classifier family. In: Tastle, W. (ed.) Data Mining Applications Using Artificial Adaptive Systems, pp. 141–182. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-4223-3_5

    Chapter  Google Scholar 

  13. Ettaouil, M., Ghanou, Y.: Neural architectures optimization and Genetic algorithms. Wseas Trans. Comput. 8(3), 526–537 (2009)

    Google Scholar 

  14. Dahmouni, A., El Moutaouakil, K., Satori, K.: Robust face recognition using local gradient probabilistic pattern (LGPP). In: El Oualkadi, A., Choubani, F., El Moussati, A. (eds.) MedCT 2015. LNEE, vol. 380, pp. 277–286. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30301-7_29

    Chapter  Google Scholar 

  15. Fox, E.A.: Extending the boolean and vector space models of information retrieval with P-norm queries and multiple concept types (1983)

    Google Scholar 

  16. Maron, M.E., Kuhns, J.L.: On relevance, probabilistic indexing and information retrieval. J. ACM (JACM) 7(3), 216–244 (1960)

    Article  Google Scholar 

  17. Robertson, S.E., Walker, S.: On relevance weights with little relevance information. In: ACM SIGIR Forum, vol. 31, no. SI, pp. 16–24. ACM (1997)

    Google Scholar 

  18. Hall, M.A.: Correlation-based feature subset selection for machine learning. Thesis submitted in partial fulfillment of the requirement of the degree of Doctor of Philosophy at the University of Waikato (1998)

    Google Scholar 

  19. Kjaerulff, U.B., Madsen, A.L.: Bayesian Networks and Influence Diagrams: A Guide to Construction and Analysis. Springer, New York (2008). https://doi.org/10.1007/978-0-387-74101-7. vol. 200, p. 114

    Book  MATH  Google Scholar 

  20. Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pp. 616–623 (2003)

    Google Scholar 

  21. Aharrane, N., Dahmouni, A., El Moutaouakil, K., Satori, K.: A robust statistical set of features for Amazigh handwritten characters. Pattern Recognit. Image Anal. 27(1), 41–52 (2017)

    Article  Google Scholar 

  22. Raschka, S.: Naive bayes and text classification I-introduction and theory. arXiv preprint arXiv:1410.5329 (2014)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mariem Bounabi .

Editor information

Editors and Affiliations

Appendix

Appendix

See Tables 7 and 8 and Fig. 5.

Table 7. Results for different algorithms for learning the network structure using the new probabilistic method
Table 8. Results for different Number of hidden nodes

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bounabi, M., El Moutaouakil, K., Satori, K. (2018). A Probabilistic Vector Representation and Neural Network for Text Classification. In: Tabii, Y., Lazaar, M., Al Achhab, M., Enneya, N. (eds) Big Data, Cloud and Applications. BDCA 2018. Communications in Computer and Information Science, vol 872. Springer, Cham. https://doi.org/10.1007/978-3-319-96292-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-96292-4_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-96291-7

  • Online ISBN: 978-3-319-96292-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics