Skip to main content

Persian Text Classification Based on K-NN Using Wordnet

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7345))

Abstract

K-NN is widely used for text classification purpose. Basic K-NN has poor accuracy; other methods should be applied to basic K-NN to improve accuracy and efficiency. In this paper we propose a method that uses wordnet to increase similarity of documents under the same category. Documents are represented by single words and their frequencies, by using wordnet, frequency of related words is changed to acquire higher accuracy. Information gained is used to eliminate terms that are not discriminated. Words like "and", "or" and "that" in English are not important in text classification and the best way to eliminate them is to calculate their information gain. PCA is used to reduce number of features and increase speed of the method. Applying this method, we designed a faster and much accurate classifier for Persian language. Experiments show that applying this preprocessing will increase accuracy and speed of K-NN. Accuracy of the proposed K-NN classifier on Hamshahri corpus is 88.18%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Keikha, M., Khonsari, A., Oroumchian, F.: Rich document representation and classification: An analysis. Knowledge-Based systems 22, 67–71 (2009)

    Article  Google Scholar 

  2. Nather, P.: Text Categorization, Diploma thesis (2005)

    Google Scholar 

  3. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  4. Han, E.-H(S.), Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  5. van Mun, P.P.T.M.: Text Classification in information retrieval using winnow, http://citeseer.ist.psu.edu/cs

  6. Aas, K., Eikvil, L.: Text ctegorisation: A survey, http://citeseer.ist.psu.edu/aas99text.html

  7. Tan, S.: An effective refinement strategy for KNN text classifier. Expert Systems with Applications 30, 290–298 (2006)

    Article  Google Scholar 

  8. Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Systems with Applications 33, 1–5 (2007)

    Article  Google Scholar 

  9. Fellbaum, C.: WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press, Cambridge (1998)

    Google Scholar 

  10. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M., Soroa, A.: A Study on Similarity and RelatednessUsing Distributional andWordNet-based Approaches. In: NAACL 2009 Proceedings of Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (2009)

    Google Scholar 

  11. Sagot, B., Fiser, D.: Building a free French wordnet from multilingual sources. In: Proceedings of OntoLex (2008)

    Google Scholar 

  12. Keyvan, F., Borjian, H., Kasehff, M., Fellbaum, C.: Developing PersiaNet: The Persian Wordnet. In: 3rd Global Wordnet Conference (2007)

    Google Scholar 

  13. Rouhizadeh, M., Yarmohammadi, M.A., Shamsfard, M.: Developing the Persian WordNet of Verbs; Issues of Compound Verbs and Building the Editor, Resource Centre for Indian Language Technology Solutions (2009)

    Google Scholar 

  14. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    MATH  Google Scholar 

  15. Obozinski, G.: Multi-class and Structured Classification. Practical Machine Learning CS 294 (2006)

    Google Scholar 

  16. Hamshahri Newspaper, http://www.hamshahri.net/

  17. Darrudi, E., Oroumchian, F., Hejazi, M.R.: Assessment of a modern Persian corpus. In: Proceedings of the Second Workshop on Information Technology and Its Disciplines (WITID). ITRC, Iran (2004)

    Google Scholar 

  18. AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F.: Hamshahri: A Standard Persian Text Collection. Knowledge-Based Systems 22, 382–387 (2009)

    Article  Google Scholar 

  19. Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Famian, A., Bagherbeigi, S., Fekri, E., Monshizadeh, M., Assi, M.: Semi Automatic Development of FarsNet; The Persian WordNet. In: 5th Global WordNet Conference (GWA 2010), Mumbai, India (2010)

    Google Scholar 

  20. Li, B., Yu, S., Lu, Q.: Proceedings of the 20nd International Conference on Computer Processing of Oriental Languages (2003)

    Google Scholar 

  21. Ufuk, I.: M.S. Thesis, Report of Text Categorization (2001)

    Google Scholar 

  22. Basiri, M.E., Nemati, S., Ghasem Aghaei, N.: Comparosion of Persian text classifiers using kNN and fkNN algorithms and feature selection based on information gain and document frequency. In: 13th Conference of Computer Society of Iran, Sharif university of technology, Tehran (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Parchami, M., Akhtar, B., Dezfoulian, M. (2012). Persian Text Classification Based on K-NN Using Wordnet. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds) Advanced Research in Applied Artificial Intelligence. IEA/AIE 2012. Lecture Notes in Computer Science(), vol 7345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31087-4_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31087-4_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31086-7

  • Online ISBN: 978-3-642-31087-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics