Skip to main content

Disentangling from Babylonian Confusion – Unsupervised Language Identification

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Abstract

This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barabási, A.L., Albert, R., Jeong, H.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)

    Google Scholar 

  2. Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C.: Language-independent Methods for Compiling Monolingual Lexical Data. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 217–228. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  3. Biemann, C., Böhm, K., Heyer, G., Melz, R.: Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems. In: Proceedings of I2CS, Guadalajara, Mexico (2004)

    Google Scholar 

  4. Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175. UNLV Publications/Reprographics (1994)

    Google Scholar 

  5. Dunning, T.: Statistical Identification of Language. Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University (March 1994)

    Google Scholar 

  6. Ferrer-i-Cancho, R., Sole, R.V.: The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences 268(1482), 2261–2265 (2001)

    Article  Google Scholar 

  7. Grefenstette, G.: Comparing Two Language Identification Schemes. In: The proceedings of 3rd International Conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (December 1995)

    Google Scholar 

  8. Johnson, S.: Solving the problem of language recognition. Technical Report, School of Computer Studies, University of Leeds (1993)

    Google Scholar 

  9. Quasthoff, U., Wolff, C.: The Poisson Collocation Measure and its Applications. In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien (2002)

    Google Scholar 

  10. Pantel, P., Ravichandran, D., Hovy, E.: Towards Terascale Semantic Acquisition. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland (2004)

    Google Scholar 

  11. Rehm, G.: Towards Automatic Web Genre Identification. In: Proceedings of the 35th Hawaii International Conference on System Sciences, Hawaii (2002)

    Google Scholar 

  12. Reuters Corpus. vol. 1, English language (2000), http://about.reuters.com/researchandstandards/corpus

  13. Schulze, B.M.: Automatic language identification using both N-gram and word information. US Patent No. 6,167,369 (2000)

    Google Scholar 

  14. Zipf, G.K.: Relative Frequency as a Determinant of Phonetic Change (1929); Reprinted in Harvard Studies in Classical Philology, vol. XI

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Biemann, C., Teresniak, S. (2005). Disentangling from Babylonian Confusion – Unsupervised Language Identification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_87

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30586-6_87

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24523-0

  • Online ISBN: 978-3-540-30586-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics