Skip to main content

The Effect of Stemming and Stop-Word-Removal on Automatic Text Classification in Turkish Language

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9489))

Included in the following conference series:

  • 2171 Accesses

Abstract

Text classification is defined simply as the labeling of natural and unstructured language text documents using predefined categories or classes. This classification not only help organizations in improving their business communication skills and their customer satisfaction levels, but also improves the usage of unstructured data in academic and non-academic world. The aim of this study is to analyze the effect of stemming, over-sampling, and stopword-removal when doing automatic classification on Turkish content. After obtaning a Turkish Corpus, stemming, balancing, and stopword-removal is applied and the results are evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Digital Universe Invaded By Sensors, Press Release, EMC 2 (2014). http://www.emc.com/about/news/press/2014/20140409-01.htm

  2. Big Data, for better or worse: 90 % of world,s data generated over last two years, ScienceDaily, 2013. http://www.sciencedaily.com/releases/2013/05/130522085217.htm

  3. Torunoğlu, D., Çakırman, E., Ganiz, M.C., Akyokuş, S., Gürbüz, M.Z.: Analysis of preprocessing methods on classification of Turkish texts. In: International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 112–117, İstanbul (2011)

    Google Scholar 

  4. Can, F., Kocberber, S., Balcik, E., Kaynak, C., Ocalan, H.C., Vursavas, O.M.: Information retrieval on Turkish texts. J. Am. Soc. Inform. Sci. Technol. 59(3), 407–421 (2008)

    Article  Google Scholar 

  5. Güran, A., Akyokuş, S., Bayazıt, N.G., Gürbüz, M.Z.: Turkish text categorization using N-Gram words. In: International Symposium on Innovations in Intelligent Systems and Applications, Trabzon (2009)

    Google Scholar 

  6. Akkuş, B.K., Çakıcı, R.: Categorization of Turkish news documents with morphological analysis. In: Proceedings of the ACL Student Research Workshop, pp. 1–8, Sofia (2013)

    Google Scholar 

  7. Akın, A.A., Akın, M.D.: Zemberek an open source NLP framework for Turkic languages (2007)

    Google Scholar 

  8. Amasyalı, M.F., Diri, B.: Automatic Turkish text categorization in terms of author, genre and gender. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 221–226. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Özgür, L., Güngör, T., Gürgen, F.: Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish. Pattern Recogn. Lett. 25(16), 1819–1831 (2004)

    Article  Google Scholar 

  10. Çataltepe, Z., Turan, Y., Kesgin, F.: Turkish document classification using shorter roots. In: IEEE 15th Signal Processing and Communications Applications, Eskişehir (2007)

    Google Scholar 

  11. Çıltık, A., Güngör, T.: Time efficient spam e-mail filtering using n-gram models. Pattern Recogn. Lett. 29(1), 19–33 (2008)

    Article  Google Scholar 

  12. Amasyalı, M.F., Beken, A.: Measurement of Turkish word semantic similarity and text categorization application. In: IEEE 17th Signal Processing and Communications Applications Conference, Antalya (2009)

    Google Scholar 

  13. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. Arch. 16(1), 321–357 (2002)

    MATH  Google Scholar 

  14. Basu, A., Walters, C., Shepherd, M.: Support vector machines for text categorization. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS 2003), Track 4, vol. 4, pp. 103.3, Washington (2003)

    Google Scholar 

  15. Burges, C.J.C.: Simplified support vector decision rules. In: 13th International Conference on Machine Learning, p. 71 (1996)

    Google Scholar 

  16. Kwok, J.T.: Automated text categorization using support vector machine. In: Proceedings of the International Conference on Neural Information Processing (ICONIP), pp. 347–351, Kitakyushu (1998)

    Google Scholar 

  17. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mustafa Çağataylı .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Çağataylı, M., Çelebi, E. (2015). The Effect of Stemming and Stop-Word-Removal on Automatic Text Classification in Turkish Language. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9489. Springer, Cham. https://doi.org/10.1007/978-3-319-26532-2_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26532-2_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26531-5

  • Online ISBN: 978-3-319-26532-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics