Abstract
This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Barabási, A.L., Albert, R., Jeong, H.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)
Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent Methods for Compiling Monolingual Lexical Data. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 217–228. Springer, Heidelberg (2004)
Biemann, C., Böhm, K., Heyer, G., Melz, R.: Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems. In: Proceedings of I2CS, Guadalajara, Mexico (2004)
Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175. UNLV Publications/Reprographics (1994)
Dunning, T.: Statistical Identification of Language. Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University (March 1994)
Ferrer-i-Cancho, R., Sole, R.V.: The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences 268(1482), 2261–2265 (2001)
Grefenstette, G.: Comparing Two Language Identification Schemes. In: The proceedings of 3rd International Conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)
Johnson, S.: Solving the problem of language recognition. Technical Report, School of Computer Studies, University of Leeds (1993)
Quasthoff, U., Wolff, C.: The Poisson Collocation Measure and its Applications. In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien (2002)
Pantel, P., Ravichandran, D., Hovy, E.: Towards Terascale Semantic Acquisition. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland (2004)
Rehm, G.: Towards Automatic Web Genre Identification. In: Proceedings of the 35th Hawaii International Conference on System Sciences, Hawaii (2002)
Reuters Corpus. vol. 1, English language (2000), http://about.reuters.com/researchandstandards/corpus
Schulze, B.M.: Automatic language identification using both N-gram and word information. US Patent No. 6,167,369 (2000)
Zipf, G.K.: Relative Frequency as a Determinant of Phonetic Change (1929); Reprinted in Harvard Studies in Classical Philology, vol. XI
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Biemann, C., Teresniak, S. (2005). Disentangling from Babylonian Confusion – Unsupervised Language Identification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_87
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_87
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)