Disentangling from Babylonian Confusion – Unsupervised Language Identification

Biemann, Chris; Teresniak, Sven

doi:10.1007/978-3-540-30586-6_87

Disentangling from Babylonian Confusion – Unsupervised Language Identification

Chris Biemann¹⁷ &
Sven Teresniak¹⁷

Conference paper

2245 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Abstract

This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barabási, A.L., Albert, R., Jeong, H.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)
Google Scholar
Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent Methods for Compiling Monolingual Lexical Data. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 217–228. Springer, Heidelberg (2004)
Chapter Google Scholar
Biemann, C., Böhm, K., Heyer, G., Melz, R.: Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems. In: Proceedings of I2CS, Guadalajara, Mexico (2004)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175. UNLV Publications/Reprographics (1994)
Google Scholar
Dunning, T.: Statistical Identification of Language. Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University (March 1994)
Google Scholar
Ferrer-i-Cancho, R., Sole, R.V.: The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences 268(1482), 2261–2265 (2001)
Article Google Scholar
Grefenstette, G.: Comparing Two Language Identification Schemes. In: The proceedings of 3rd International Conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)
Google Scholar
Johnson, S.: Solving the problem of language recognition. Technical Report, School of Computer Studies, University of Leeds (1993)
Google Scholar
Quasthoff, U., Wolff, C.: The Poisson Collocation Measure and its Applications. In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien (2002)
Google Scholar
Pantel, P., Ravichandran, D., Hovy, E.: Towards Terascale Semantic Acquisition. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland (2004)
Google Scholar
Rehm, G.: Towards Automatic Web Genre Identification. In: Proceedings of the 35th Hawaii International Conference on System Sciences, Hawaii (2002)
Google Scholar
Reuters Corpus. vol. 1, English language (2000), http://about.reuters.com/researchandstandards/corpus
Schulze, B.M.: Automatic language identification using both N-gram and word information. US Patent No. 6,167,369 (2000)
Google Scholar
Zipf, G.K.: Relative Frequency as a Determinant of Phonetic Change (1929); Reprinted in Harvard Studies in Classical Philology, vol. XI
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Institute, NLP Dept., Leipzig University, Augustusplatz 10/11, 04109, Leipzig, Germany
Chris Biemann & Sven Teresniak

Authors

Chris Biemann
View author publications
You can also search for this author in PubMed Google Scholar
Sven Teresniak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Biemann, C., Teresniak, S. (2005). Disentangling from Babylonian Confusion – Unsupervised Language Identification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_87

Download citation

DOI: https://doi.org/10.1007/978-3-540-30586-6_87
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics