Skip to main content

A Comparison of Language Identification Approaches on Short, Query-Style Texts

  • Conference paper
Advances in Information Retrieval (ECIR 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5993))

Included in the following conference series:

Abstract

In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts we even observed accuracy values close to 100%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Oakes, M., Xu, Y.: A search engine based on query logs, and search log analysis at the university of Sunderland. In: CLEF 2009: Proceedings of the 10th Cross Language Evaluation Forum (2009)

    Google Scholar 

  2. Dunning, T.: Statistical identification of language. Technical Report MCCS-94-273, Computing Research Laboratory, New Mexico State University (1994)

    Google Scholar 

  3. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: SDAIR 1994, Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (1994)

    Google Scholar 

  4. Vojtek, P., Bieliková, M.: Comparing natural language identification methods based on Markov processes. In: Computer Treatment of Slavic and East European Languages, 4th Int. Seminar, pp. 271–282 (2007)

    Google Scholar 

  5. Suen, C.Y.: N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Ingelligence PAMI-1(2), 164–172 (1979)

    Article  Google Scholar 

  6. Sibun, P., Reynar, J.C.: Language identification: Examining the issues (1996)

    Google Scholar 

  7. Řehūřek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Computational Linguistics and Intelligent Text Processing. LNCS, vol. 5449, pp. 357–368. Springer, Heidelberg (2009)

    Google Scholar 

  8. Berkling, K., Arai, T., Barnard, E.: Analysis of phoneme-based features for language identification. In: Proc. ICASSP, pp. 289–292 (1994)

    Google Scholar 

  9. Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: RIAO 2000, vol. 2, pp. 943–961 (2000)

    Google Scholar 

  10. Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gottron, T., Lipka, N. (2010). A Comparison of Language Identification Approaches on Short, Query-Style Texts. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_59

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12275-0_59

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12274-3

  • Online ISBN: 978-3-642-12275-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics