Abstract
Document categorization is a way of determining a category for a given document. Supervised methods mostly rely on a training data and rich linguistic resources that are either language-specific or generic. This study proposes a knowledge-poor approach to text categorization without using any sets of rules or language specific resources such as part-of-speech tagger or shallow parser. Knowledge-poor here refers to lack of a reasonable amount of background knowledge. The proposed system architecture takes data as-is and simply separates tokens by space. Documents represented in vector space models are used as training data for many machine learning algorithm. We empirically examined and compared a several factors from similarity metrics to learning algorithms in a variety of experimental setups. Although researchers believe that some particular classifiers or metrics are better than others for text categorization, the recent studies disclose that the ranking of the models purely depends on the class, experimental setup and domain as well. The study features extensive evaluation, comparison within a variety of experiments. We evaluate models and similarity metrics for Turkish language as one of the agglutinative language especially within poor-knowledge framework. It is seen that output of the study would be very beneficial for other studies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Scott, S., Matwin, S.: Text Classification Using WordNet Hypernyms. The Workshop on usage of WordNet in NLP Systems. In: COLING-ACL (1998)
Salton, G., Wong, A., Yang, C.-S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)
Liu, T., Chen, Z., Zhang, B., Ma, W.-Y., Wu, G.: Improving Text Classification using Local Latent Semantic Indexing. In: International Conference on Data Mining (ICDM 2004), pp. 162–169. IEEE Computer Society, Washington, DC (2004)
Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, pp. 81–93 (1994)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res., 361–397 (2004)
Schtze, H., Hull, D.A., Pedersen, J.O.: A comparison of classifiers and document representations for the routing problem. In: ACM SIGIR 1995, New York, NY, USA, pp. 229–237 (1995)
Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: The Fourteenth International Conference on Machine Learning, pp. 412–420 (1997)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, pp. 42–49 (1999)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 1289–1305 (2003)
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: International Conference on Machine Learning, pp. 297–304 (2004)
Li, S., Xia, R., Zong, C., Huang, C.-R.: A framework of feature selection methods for text categorization. In: ACL, pp. 692–700 (2009)
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM Symposium on Applied Computing, pp. 784–788 (2003)
Chen, X., Wasikowski, M.: FAST: A roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD, Las Vegas, pp. 124–132 (2008)
Ogura, H., Amano, H., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications 38, 4978–4989 (2011)
Manning, C.D., Schtze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Sriurai, W.: Improving text categorization by using a topic model. Advanced Computing: An International Journal 2(6), 21–27 (2011)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European Conference on Machine Learning, pp. 137–142 (1998)
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 721–735 (2009)
Chen, Y.-T., Chen, M.C.: Using chi-square statistics to measure similarities for text categorization. Expert Syst. Appl. 38(4), 3085–3090 (2011)
Singhal, A.: Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)
Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: SIGIR 1994, pp. 192–201 (1994)
Lang, K.: NewsWeeder: learning to filter netnews. Paper Presented at the Meeting of the Proceedings of the 12th International Conference on Machine Learning (1995)
Witten, I.H., Frank, E.: Data mining: Practical machine learning tools with java implementations. Morgan Kaufmann, San Francisco (2000)
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl., 2758–2765 (2011)
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3) (2011)
Amasyalı, M.F., Diri, B.: Automatic turkish text categorization in terms of author, genre and gender. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 221–226. Springer, Heidelberg (2006)
Guran, A., Akyokus, S., Bayazit, N.G., Gurbuz, M.Z.: Turkish Text Categorization using N-Gram words. In: International Symposium on Innovations in Intelligent Systems and Applications, pp. 369–373 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yildirim, S. (2014). A Knowledge-Poor Approach to Turkish Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-54903-8_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)