Abstract
The increasing of the textual databases and its representation in large spaces prevents the automation of the treatment of these great masses and the extraction of knowledge. In order to address the challenges of high dimensionality which using the methods and technics of the text mining. Where the term frequency-inverse document frequency (TF-IDF), weighting method, is the most required approach to represent the document. Unfortunately, TF-IDF produces descriptors of large sizes (generally greater than 1000), which requires models with great complexity. However, the texts classification systems based on these models suffer from the overfitting phenomenon and are very slow. Therefore, to overcome these problems, we use the select attributes methods; by giving the deterministic aspect of this latter, we risk to lose huge information. Thus, to recover from this loss, we propose a probabilistic vector representation of each document, based on the relevant terms selected previously. Then, we associate a set of features to each document composed by local and global probabilistic coefficients basing on the selected terms. More specifically and precisely, the components formulas are composed by the frequency of each descriptor, the length of each document and the size of the corpus. To show the performance of this treatment we propose comparative studies between TF-IDF representation and the new probabilistic representation, to classify the BBCSPORT corpus. Moreover, in the classification phase, we use several versions of Bayesian Network and Multilayer Perceptron. The obtained results are satisfied, where the neural network classifier, multilayer perceptron, gives 100% as a recognition rate, using the new representation and 94.69%, using the simple TF-IDF weighting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmad, M.: Machine learning approach to text mining: a review. Int. J. 4(6), 1125–1131 (2014)
Tan, A.H.: Text mining: the state of the art and the challenges. In: Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, vol. 8, pp. 65–70. Sn (1999)
Kumar, L., Bhatia, P.K.: Text mining: concepts, process and applications. Int. J. Global Res. Comput. Sci. (UGC Approv. J.) 4(3), 36–39 (2013)
Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning. J. Comput. Sci. 98, 4–6 (1998)
Trinh, A.P.: Classification de texte et estimation probabiliste par Machine à Vecteurs de Support. Actes du troisième DÉfi Fouille de Textes, pp. 77 (2007)
Vu, T., Denoyer, L., Gallinari, P.: Un modèle statistique pour la classification de documents structurés (2003)
Eensoo, E., Nouvel, D., Martin, A., Valette, M.: Combiner analyses textométriques, apprentissage supervisé et représentation vectorielle pour l’analyse de la subjectivité. In: 11e Défi Fouille de Texte (DEFT 2015), Caen, France (2016)
Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans. Knowl. Data Eng. 20(5), 641–652 (2008)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Bouckaert, R.R.: Bayesian network classifiers in weka for version 3-5-7. Artificial Intelligence Tools 11(3), 369–387 (2008)
Panchal, G., Ganatra, A., Kosta, Y.P., Panchal, D.: Behaviour analysis of multilayer perceptronswith multiple hidden neurons and hidden layers. Int. J. Comput. Theory Eng. 3(2), 332 (2011)
Buscema, M., Tastle, W.J., Terzi, S.: Meta Net: a new meta-classifier family. In: Tastle, W. (ed.) Data Mining Applications Using Artificial Adaptive Systems, pp. 141–182. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-4223-3_5
Ettaouil, M., Ghanou, Y.: Neural architectures optimization and Genetic algorithms. Wseas Trans. Comput. 8(3), 526–537 (2009)
Dahmouni, A., El Moutaouakil, K., Satori, K.: Robust face recognition using local gradient probabilistic pattern (LGPP). In: El Oualkadi, A., Choubani, F., El Moussati, A. (eds.) MedCT 2015. LNEE, vol. 380, pp. 277–286. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30301-7_29
Fox, E.A.: Extending the boolean and vector space models of information retrieval with P-norm queries and multiple concept types (1983)
Maron, M.E., Kuhns, J.L.: On relevance, probabilistic indexing and information retrieval. J. ACM (JACM) 7(3), 216–244 (1960)
Robertson, S.E., Walker, S.: On relevance weights with little relevance information. In: ACM SIGIR Forum, vol. 31, no. SI, pp. 16–24. ACM (1997)
Hall, M.A.: Correlation-based feature subset selection for machine learning. Thesis submitted in partial fulfillment of the requirement of the degree of Doctor of Philosophy at the University of Waikato (1998)
Kjaerulff, U.B., Madsen, A.L.: Bayesian Networks and Influence Diagrams: A Guide to Construction and Analysis. Springer, New York (2008). https://doi.org/10.1007/978-0-387-74101-7. vol. 200, p. 114
Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pp. 616–623 (2003)
Aharrane, N., Dahmouni, A., El Moutaouakil, K., Satori, K.: A robust statistical set of features for Amazigh handwritten characters. Pattern Recognit. Image Anal. 27(1), 41–52 (2017)
Raschka, S.: Naive bayes and text classification I-introduction and theory. arXiv preprint arXiv:1410.5329 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Bounabi, M., El Moutaouakil, K., Satori, K. (2018). A Probabilistic Vector Representation and Neural Network for Text Classification. In: Tabii, Y., Lazaar, M., Al Achhab, M., Enneya, N. (eds) Big Data, Cloud and Applications. BDCA 2018. Communications in Computer and Information Science, vol 872. Springer, Cham. https://doi.org/10.1007/978-3-319-96292-4_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-96292-4_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96291-7
Online ISBN: 978-3-319-96292-4
eBook Packages: Computer ScienceComputer Science (R0)