A Knowledge-Poor Approach to Turkish Text Categorization

Yildirim, Savaş

doi:10.1007/978-3-642-54903-8_36

Savaş Yildirim¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1708 Accesses
1 Citations

Abstract

Document categorization is a way of determining a category for a given document. Supervised methods mostly rely on a training data and rich linguistic resources that are either language-specific or generic. This study proposes a knowledge-poor approach to text categorization without using any sets of rules or language specific resources such as part-of-speech tagger or shallow parser. Knowledge-poor here refers to lack of a reasonable amount of background knowledge. The proposed system architecture takes data as-is and simply separates tokens by space. Documents represented in vector space models are used as training data for many machine learning algorithm. We empirically examined and compared a several factors from similarity metrics to learning algorithms in a variety of experimental setups. Although researchers believe that some particular classifiers or metrics are better than others for text categorization, the recent studies disclose that the ranking of the models purely depends on the class, experimental setup and domain as well. The study features extensive evaluation, comparison within a variety of experiments. We evaluate models and similarity metrics for Turkish language as one of the agglutinative language especially within poor-knowledge framework. It is seen that output of the study would be very beneficial for other studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Text categorization: past and present

Article 30 September 2020

Knowledge-Based Dataless Text Categorization

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

References

Scott, S., Matwin, S.: Text Classification Using WordNet Hypernyms. The Workshop on usage of WordNet in NLP Systems. In: COLING-ACL (1998)
Google Scholar
Salton, G., Wong, A., Yang, C.-S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)
Article MATH Google Scholar
Liu, T., Chen, Z., Zhang, B., Ma, W.-Y., Wu, G.: Improving Text Classification using Local Latent Semantic Indexing. In: International Conference on Data Mining (ICDM 2004), pp. 162–169. IEEE Computer Society, Washington, DC (2004)
Google Scholar
Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, pp. 81–93 (1994)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res., 361–397 (2004)
Google Scholar
Schtze, H., Hull, D.A., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: ACM SIGIR 1995, New York, NY, USA, pp. 229–237 (1995)
Google Scholar
Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: The Fourteenth International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, pp. 42–49 (1999)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 1289–1305 (2003)
Google Scholar
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: International Conference on Machine Learning, pp. 297–304 (2004)
Google Scholar
Li, S., Xia, R., Zong, C., Huang, C.-R.: A framework of feature selection methods for text categorization. In: ACL, pp. 692–700 (2009)
Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM Symposium on Applied Computing, pp. 784–788 (2003)
Google Scholar
Chen, X., Wasikowski, M.: FAST: A roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD, Las Vegas, pp. 124–132 (2008)
Google Scholar
Ogura, H., Amano, H., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications 38, 4978–4989 (2011)
Article Google Scholar
Manning, C.D., Schtze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article MathSciNet Google Scholar
Sriurai, W.: Improving text categorization by using a topic model. Advanced Computing: An International Journal 2(6), 21–27 (2011)
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European Conference on Machine Learning, pp. 137–142 (1998)
Google Scholar
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 721–735 (2009)
Article Google Scholar
Chen, Y.-T., Chen, M.C.: Using chi-square statistics to measure similarities for text categorization. Expert Syst. Appl. 38(4), 3085–3090 (2011)
Article Google Scholar
Singhal, A.: Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)
Google Scholar
Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: SIGIR 1994, pp. 192–201 (1994)
Google Scholar
Lang, K.: NewsWeeder: learning to filter netnews. Paper Presented at the Meeting of the Proceedings of the 12th International Conference on Machine Learning (1995)
Google Scholar
Witten, I.H., Frank, E.: Data mining: Practical machine learning tools with java implementations. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl., 2758–2765 (2011)
Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3) (2011)
Google Scholar
Amasyalı, M.F., Diri, B.: Automatic turkish text categorization in terms of author, genre and gender. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 221–226. Springer, Heidelberg (2006)
Chapter Google Scholar
Guran, A., Akyokus, S., Bayazit, N.G., Gurbuz, M.Z.: Turkish Text Categorization using N-Gram words. In: International Symposium on Innovations in Intelligent Systems and Applications, pp. 369–373 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering, Istanbul Bilgi University, Santral Istanbul Campus, Eyüp, Istanbul, Turkey
Savaş Yildirim

Authors

Savaş Yildirim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yildirim, S. (2014). A Knowledge-Poor Approach to Turkish Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_36

Download citation

DOI: https://doi.org/10.1007/978-3-642-54903-8_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics