Abstract
Identifying the language of an e-text is complicated by the existence of a number of character sets for a single language. We present a language identification system that uses the Multivariate Analysis (MVA) for dimensionality reduction and classification. We compare its performance with existing schemes viz., the N-grams and compression.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4) (January 2002)
Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Manly, B.F.J.: Multivariate Statistical Methods. A Primer. Chapman & Hall, Boca Raton
Singular value decomposition and principal component analysis. In: Berrar, D.P., Dubitzky, W., Granzow, M. (eds.) A Practical Approach to Microarray Data Analysis, pp. 91–109. Kluwer, Norwell (2003); LANL LA-UR-02-4001
Dunning, T.: Statistical identification of language. Computing Research Laboratory Technical Memo MCCS 94-273, New Mexico State University, Las Cruces, NM (1994)
Canvar, W., Trenkle, J.: N-gram based text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–176 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vinosh Babu, J., Baskaran, S. (2005). Automatic Language Identification Using Multivariate Analysis. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_89
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_89
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)