Abstract
Identification of the language of documents has traditionally been accomplished using dictionaries or other such language sources. This paper presents a novel algorithm for identifying the language of documents using much less information about the language than traditional methods. In addition, if no information about the language of incoming documents is known, the algorithm groups the documents into language groups, despite the deficit of language knowledge. The algorithm is based on the vector space model of information retrieval and uses a matrix projection operator and the singular value decomposition to identify terms that distinguish between languages. Experimental results show that the algorithm works reasonably well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Marc Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267:843–848, 1995.
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990.
Susan T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments & Computers, 23(2):229–236, 1991.
Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, 1989.
Gregory Grefenstette. Comparing two language identification schemes. In Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data, 1995.
Chris Marron and Joe McCloskey. Optimal partitions and clustering. In Proceedings of the 1997 Conference on Linear Algebra and Applications. SIAM, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mather, L.A. (1998). A Linear Algebra Approach to Language Identification. In: Munson, E.V., Nicholas, C., Wood, D. (eds) Principles of Digital Document Processing. PODDP 1998. Lecture Notes in Computer Science, vol 1481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49654-8_8
Download citation
DOI: https://doi.org/10.1007/3-540-49654-8_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65086-7
Online ISBN: 978-3-540-49654-0
eBook Packages: Springer Book Archive