Centroid-Based Language Identification Using Letter Feature Set

Takcı, Hidayet; Soğukpınar, İbrahim

doi:10.1007/978-3-540-24630-5_79

Hidayet Takcı⁵ &
İbrahim Soğukpınar⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2945))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

961 Accesses
3 Citations

Abstract

In recent years, an unexpected amount of growth of the text documents volume has been observed on the internet, intranet, in digital libraries and newsgroups. To obtain useful information and meaningful patterns from these documents, a great many researchers known under the term “text mining” have been carried out. Among them text categorization is to be mentioned that covers the problem of classifying documents relative to their similarities. One of techniques applied in this area is called centroid-based document classification method. All researchers on text categorization use the notion of frequency somehow or other. In this study, letter frequencies (LF) have been used for text categorization. By making use of letter frequencies information, the centroid-based document classification has been carried out. An experiment has been done on language detection for text documents. Its results allow propose that the letter-based text categorization should be done prior to term based text categorization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ogawa, Y., Iwasaki, M.: A new character-based indexing method using frequency data for Japanese documents. In: Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1995)
Google Scholar
Grefenstette, G.: Comparing two language identification schemes. In: JADT 1995, 3rd International conference on Statistical Analysis of Textual data, Rome, December 11-13 (1995)
Google Scholar
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. W. H. Freeman, San Francisco (1973)
MATH Google Scholar
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88, 4 (2002)
Article Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Han, E.-H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Principles of Data Mining and Knowledge Discovery, pp. 424–431 (2000)
Google Scholar
Visa, A.: Technology of Text Mining. In: Perner, P. (ed.) MLDM 2001. LNCS (LNAI), vol. 2123, pp. 1–11. Springer, Heidelberg (2001)
Chapter Google Scholar
Johnson, S.: Solving the problem of language recognition. Technical report, School of Computer Studies, University of Leeds (1993)
Google Scholar
Churcher, G.: Distinctive character sequences. Personal communication (1994)
Google Scholar
Hayes, J.: Language Recognition using two and three letter clusters. Technical report, School of Computer Studies, University of Leeds (1993)
Google Scholar
Chien, L.-F., Pu, H.-T.: Important Issues on Chinese Information Retrieval. Computational Linguistics and Chinese Language Processing 1(1), 205–221 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Gebze Institute of Technology, 41400, Gebze/Kocaeli, Turkey
Hidayet Takcı & İbrahim Soğukpınar

Authors

Hidayet Takcı
View author publications
You can also search for this author in PubMed Google Scholar
İbrahim Soğukpınar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takcı, H., Soğukpınar, İ. (2004). Centroid-Based Language Identification Using Letter Feature Set. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_79

Download citation

DOI: https://doi.org/10.1007/978-3-540-24630-5_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics