Abstract
In recent years, an unexpected amount of growth of the text documents volume has been observed on the internet, intranet, in digital libraries and newsgroups. To obtain useful information and meaningful patterns from these documents, a great many researchers known under the term “text mining” have been carried out. Among them text categorization is to be mentioned that covers the problem of classifying documents relative to their similarities. One of techniques applied in this area is called centroid-based document classification method. All researchers on text categorization use the notion of frequency somehow or other. In this study, letter frequencies (LF) have been used for text categorization. By making use of letter frequencies information, the centroid-based document classification has been carried out. An experiment has been done on language detection for text documents. Its results allow propose that the letter-based text categorization should be done prior to term based text categorization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ogawa, Y., Iwasaki, M.: A new character-based indexing method using frequency data for Japanese documents. In: Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1995)
Grefenstette, G.: Comparing two language identification schemes. In: JADT 1995, 3rd International conference on Statistical Analysis of Textual data, Rome, December 11-13 (1995)
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. W. H. Freeman, San Francisco (1973)
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88, 4 (2002)
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Han, E.-H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Principles of Data Mining and Knowledge Discovery, pp. 424–431 (2000)
Visa, A.: Technology of Text Mining. In: Perner, P. (ed.) MLDM 2001. LNCS (LNAI), vol. 2123, pp. 1–11. Springer, Heidelberg (2001)
Johnson, S.: Solving the problem of language recognition. Technical report, School of Computer Studies, University of Leeds (1993)
Churcher, G.: Distinctive character sequences. Personal communication (1994)
Hayes, J.: Language Recognition using two and three letter clusters. Technical report, School of Computer Studies, University of Leeds (1993)
Chien, L.-F., Pu, H.-T.: Important Issues on Chinese Information Retrieval. Computational Linguistics and Chinese Language Processing 1(1), 205–221 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Takcı, H., Soğukpınar, İ. (2004). Centroid-Based Language Identification Using Letter Feature Set. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_79
Download citation
DOI: https://doi.org/10.1007/978-3-540-24630-5_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive