Abstract
K-NN is widely used for text classification purpose. Basic K-NN has poor accuracy; other methods should be applied to basic K-NN to improve accuracy and efficiency. In this paper we propose a method that uses wordnet to increase similarity of documents under the same category. Documents are represented by single words and their frequencies, by using wordnet, frequency of related words is changed to acquire higher accuracy. Information gained is used to eliminate terms that are not discriminated. Words like "and", "or" and "that" in English are not important in text classification and the best way to eliminate them is to calculate their information gain. PCA is used to reduce number of features and increase speed of the method. Applying this method, we designed a faster and much accurate classifier for Persian language. Experiments show that applying this preprocessing will increase accuracy and speed of K-NN. Accuracy of the proposed K-NN classifier on Hamshahri corpus is 88.18%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Keikha, M., Khonsari, A., Oroumchian, F.: Rich document representation and classification: An analysis. Knowledge-Based systems 22, 67–71 (2009)
Nather, P.: Text Categorization, Diploma thesis (2005)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Han, E.-H(S.), Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
van Mun, P.P.T.M.: Text Classification in information retrieval using winnow, http://citeseer.ist.psu.edu/cs
Aas, K., Eikvil, L.: Text ctegorisation: A survey, http://citeseer.ist.psu.edu/aas99text.html
Tan, S.: An effective refinement strategy for KNN text classifier. Expert Systems with Applications 30, 290–298 (2006)
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Systems with Applications 33, 1–5 (2007)
Fellbaum, C.: WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press, Cambridge (1998)
Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M., Soroa, A.: A Study on Similarity and RelatednessUsing Distributional andWordNet-based Approaches. In: NAACL 2009 Proceedings of Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (2009)
Sagot, B., Fiser, D.: Building a free French wordnet from multilingual sources. In: Proceedings of OntoLex (2008)
Keyvan, F., Borjian, H., Kasehff, M., Fellbaum, C.: Developing PersiaNet: The Persian Wordnet. In: 3rd Global Wordnet Conference (2007)
Rouhizadeh, M., Yarmohammadi, M.A., Shamsfard, M.: Developing the Persian WordNet of Verbs; Issues of Compound Verbs and Building the Editor, Resource Centre for Indian Language Technology Solutions (2009)
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Obozinski, G.: Multi-class and Structured Classification. Practical Machine Learning CS 294 (2006)
Hamshahri Newspaper, http://www.hamshahri.net/
Darrudi, E., Oroumchian, F., Hejazi, M.R.: Assessment of a modern Persian corpus. In: Proceedings of the Second Workshop on Information Technology and Its Disciplines (WITID). ITRC, Iran (2004)
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F.: Hamshahri: A Standard Persian Text Collection. Knowledge-Based Systems 22, 382–387 (2009)
Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Famian, A., Bagherbeigi, S., Fekri, E., Monshizadeh, M., Assi, M.: Semi Automatic Development of FarsNet; The Persian WordNet. In: 5th Global WordNet Conference (GWA 2010), Mumbai, India (2010)
Li, B., Yu, S., Lu, Q.: Proceedings of the 20nd International Conference on Computer Processing of Oriental Languages (2003)
Ufuk, I.: M.S. Thesis, Report of Text Categorization (2001)
Basiri, M.E., Nemati, S., Ghasem Aghaei, N.: Comparosion of Persian text classifiers using kNN and fkNN algorithms and feature selection based on information gain and document frequency. In: 13th Conference of Computer Society of Iran, Sharif university of technology, Tehran (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Parchami, M., Akhtar, B., Dezfoulian, M. (2012). Persian Text Classification Based on K-NN Using Wordnet. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds) Advanced Research in Applied Artificial Intelligence. IEA/AIE 2012. Lecture Notes in Computer Science(), vol 7345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31087-4_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-31087-4_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31086-7
Online ISBN: 978-3-642-31087-4
eBook Packages: Computer ScienceComputer Science (R0)