A Linear Algebra Approach to Language Identification

Mather, Laura A.

doi:10.1007/3-540-49654-8_8

Laura A. Mather⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1481))

Included in the following conference series:

International Workshop on Principles of Digital Document Processing

183 Accesses
2 Citations

Abstract

Identification of the language of documents has traditionally been accomplished using dictionaries or other such language sources. This paper presents a novel algorithm for identifying the language of documents using much less information about the language than traditional methods. In addition, if no information about the language of incoming documents is known, the algorithm groups the documents into language groups, despite the deficit of language knowledge. The algorithm is based on the vector space model of information retrieval and uses a matrix projection operator and the singular value decomposition to identify terms that distinguish between languages. Experimental results show that the algorithm works reasonably well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Marc Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267:843–848, 1995.
Article Google Scholar
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990.
Article Google Scholar
Susan T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments & Computers, 23(2):229–236, 1991.
Google Scholar
Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, 1989.
MATH Google Scholar
Gregory Grefenstette. Comparing two language identification schemes. In Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data, 1995.
Google Scholar
Chris Marron and Joe McCloskey. Optimal partitions and clustering. In Proceedings of the 1997 Conference on Linear Algebra and Applications. SIAM, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Colorado, USA
Laura A. Mather

Authors

Laura A. Mather
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, 53211, USA
Ethan V. Munson
Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD, 21250, USA
Charles Nicholas
Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong SAR
Derick Wood

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mather, L.A. (1998). A Linear Algebra Approach to Language Identification. In: Munson, E.V., Nicholas, C., Wood, D. (eds) Principles of Digital Document Processing. PODDP 1998. Lecture Notes in Computer Science, vol 1481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49654-8_8

Download citation

DOI: https://doi.org/10.1007/3-540-49654-8_8
Published: 15 September 2000
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65086-7
Online ISBN: 978-3-540-49654-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics