Abstract
Feature reduction methods have been successfully applied to text categorization. In this paper, we perform a comparative study on three feature reduction methods for text categorization, including Document Frequency (DF), Term Frequency Inverse Document Frequency (TFIDF) and Latent Semantic Analyses (LSA). Our feature set is relatively large (since there are thousands of different terms in different texts files). We propose the use of the previous feature reduction methods as a preprocessor of Back-Propagation Neural Network (BPNN) to reduce the input data on training process. The experimental results on an Arabic data set demonstrate that among the three dimensionality reduction techniques proposed, TFIDF was found to be the most effective in reducing the dimensionality of the feature space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Duwairi, R.M., Al-Refai, M.N., Khasawneh, N.: Feature Reduction Techniques for Arabic Text Categorization. Journal of the American society for information science and technology 60(11), 2347–2352 (2009)
Encyclopedia of the Nine Books for the Honorable Prophetic Traditions, Sakhr Company, http://www.Harf.com
Harrag, F., El-Qawasmeh, E.: Neural Network for Arabic Text Classification. In: The Second International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2009), pp. 778–783 (2009)
Lam, S.L.Y., Lee, D.L.: Feature Reduction for Neural Network Based Text Categorization. In: Sixth International Conference on Database Systems for Advanced Applications (DASFAA 1999), pp. 195–202 (1999)
Larkey, L., Ballesteros, L., Connell, M.E.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: Proceedings of SIGIR 2002, pp. 275–282 (2002)
Mesleh, A.A.: Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System. Journal of Computer Science 3(6), 430–435 (2007)
Salton, G., Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24(5), 513–523 (1988)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Syiam, M.M., Fayed, Z.T., Habib, M.B.: An Intelligent System for Arabic Text Categorization. International Journal of Intelligent Computing and Information Sciences 6(1), 1–19 (2006)
Wermeter, S.: Neural Network Agents for Learning Semantic Text Classification. Information Retrieval 3(2), 87–103 (2000)
Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: 22nd ACM International Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 42–49. ACM Press, New York (1999)
Yu, B., Zong-ben, X., Cheng-hua, L.: Latent Semantic Analysis for Text Categorization Using Neural Network. Knowledge-Based Systems Journal 21, 900–904 (2008)
Zahran, B.M., Kanaan, G.: Text Feature Selection using Particle Swarm Optimization Algorithm. World Applied Sciences Journal 7 (Special Issue of Computer & IT), 69–74 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Harrag, F., El-Qawasmeh, E., Al-Salman, A.M.S. (2010). A Comparative Study of Statistical Feature Reduction Methods for Arabic Text Categorization. In: Zavoral, F., Yaghob, J., Pichappan, P., El-Qawasmeh, E. (eds) Networked Digital Technologies. NDT 2010. Communications in Computer and Information Science, vol 88. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14306-9_67
Download citation
DOI: https://doi.org/10.1007/978-3-642-14306-9_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14305-2
Online ISBN: 978-3-642-14306-9
eBook Packages: Computer ScienceComputer Science (R0)