The Effect of Stemming and Stop-Word-Removal on Automatic Text Classification in Turkish Language

Çağataylı, Mustafa; Çelebi, Erbuğ

doi:10.1007/978-3-319-26532-2_19

Mustafa Çağataylı¹⁷ &
Erbuğ Çelebi¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9489))

Included in the following conference series:

International Conference on Neural Information Processing

2171 Accesses

Abstract

Text classification is defined simply as the labeling of natural and unstructured language text documents using predefined categories or classes. This classification not only help organizations in improving their business communication skills and their customer satisfaction levels, but also improves the usage of unstructured data in academic and non-academic world. The aim of this study is to analyze the effect of stemming, over-sampling, and stopword-removal when doing automatic classification on Turkish content. After obtaning a Turkish Corpus, stemming, balancing, and stopword-removal is applied and the results are evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Digital Universe Invaded By Sensors, Press Release, EMC 2 (2014). http://www.emc.com/about/news/press/2014/20140409-01.htm
Big Data, for better or worse: 90 % of world,s data generated over last two years, ScienceDaily, 2013. http://www.sciencedaily.com/releases/2013/05/130522085217.htm
Torunoğlu, D., Çakırman, E., Ganiz, M.C., Akyokuş, S., Gürbüz, M.Z.: Analysis of preprocessing methods on classification of Turkish texts. In: International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 112–117, İstanbul (2011)
Google Scholar
Can, F., Kocberber, S., Balcik, E., Kaynak, C., Ocalan, H.C., Vursavas, O.M.: Information retrieval on Turkish texts. J. Am. Soc. Inform. Sci. Technol. 59(3), 407–421 (2008)
Article Google Scholar
Güran, A., Akyokuş, S., Bayazıt, N.G., Gürbüz, M.Z.: Turkish text categorization using N-Gram words. In: International Symposium on Innovations in Intelligent Systems and Applications, Trabzon (2009)
Google Scholar
Akkuş, B.K., Çakıcı, R.: Categorization of Turkish news documents with morphological analysis. In: Proceedings of the ACL Student Research Workshop, pp. 1–8, Sofia (2013)
Google Scholar
Akın, A.A., Akın, M.D.: Zemberek an open source NLP framework for Turkic languages (2007)
Google Scholar
Amasyalı, M.F., Diri, B.: Automatic Turkish text categorization in terms of author, genre and gender. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 221–226. Springer, Heidelberg (2006)
Chapter Google Scholar
Özgür, L., Güngör, T., Gürgen, F.: Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish. Pattern Recogn. Lett. 25(16), 1819–1831 (2004)
Article Google Scholar
Çataltepe, Z., Turan, Y., Kesgin, F.: Turkish document classification using shorter roots. In: IEEE 15th Signal Processing and Communications Applications, Eskişehir (2007)
Google Scholar
Çıltık, A., Güngör, T.: Time efficient spam e-mail filtering using n-gram models. Pattern Recogn. Lett. 29(1), 19–33 (2008)
Article Google Scholar
Amasyalı, M.F., Beken, A.: Measurement of Turkish word semantic similarity and text categorization application. In: IEEE 17th Signal Processing and Communications Applications Conference, Antalya (2009)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. Arch. 16(1), 321–357 (2002)
MATH Google Scholar
Basu, A., Walters, C., Shepherd, M.: Support vector machines for text categorization. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS 2003), Track 4, vol. 4, pp. 103.3, Washington (2003)
Google Scholar
Burges, C.J.C.: Simplified support vector decision rules. In: 13th International Conference on Machine Learning, p. 71 (1996)
Google Scholar
Kwok, J.T.: Automated text categorization using support vector machine. In: Proceedings of the International Conference on Neural Information Processing (ICONIP), pp. 347–351, Kitakyushu (1998)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Cyprus International University, North Nicosia, North Cyprus
Mustafa Çağataylı & Erbuğ Çelebi

Authors

Mustafa Çağataylı
View author publications
You can also search for this author in PubMed Google Scholar
Erbuğ Çelebi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mustafa Çağataylı .

Editor information

Editors and Affiliations

University of Istanbul, Istanbul, Turkey
Sabri Arik
University at Qatar, Doha, Qatar
Tingwen Huang
Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia
Weng Kin Lai
University of Science Technology, Wuhan, China
Qingshan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Çağataylı, M., Çelebi, E. (2015). The Effect of Stemming and Stop-Word-Removal on Automatic Text Classification in Turkish Language. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9489. Springer, Cham. https://doi.org/10.1007/978-3-319-26532-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-26532-2_19
Published: 12 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26531-5
Online ISBN: 978-3-319-26532-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics