A Comparison of Language Identification Approaches on Short, Query-Style Texts

Gottron, Thomas; Lipka, Nedim

doi:10.1007/978-3-642-12275-0_59

Thomas Gottron²⁴ &
Nedim Lipka²⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5993))

Included in the following conference series:

European Conference on Information Retrieval

2350 Accesses
24 Citations
4 Altmetric

Abstract

In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts we even observed accuracy values close to 100%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Oakes, M., Xu, Y.: A search engine based on query logs, and search log analysis at the university of Sunderland. In: CLEF 2009: Proceedings of the 10th Cross Language Evaluation Forum (2009)
Google Scholar
Dunning, T.: Statistical identification of language. Technical Report MCCS-94-273, Computing Research Laboratory, New Mexico State University (1994)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: SDAIR 1994, Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (1994)
Google Scholar
Vojtek, P., Bieliková, M.: Comparing natural language identification methods based on Markov processes. In: Computer Treatment of Slavic and East European Languages, 4th Int. Seminar, pp. 271–282 (2007)
Google Scholar
Suen, C.Y.: N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Ingelligence PAMI-1(2), 164–172 (1979)
Article Google Scholar
Sibun, P., Reynar, J.C.: Language identification: Examining the issues (1996)
Google Scholar
Řehūřek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Computational Linguistics and Intelligent Text Processing. LNCS, vol. 5449, pp. 357–368. Springer, Heidelberg (2009)
Google Scholar
Berkling, K., Arai, T., Barnard, E.: Analysis of phoneme-based features for language identification. In: Proc. ICASSP, pp. 289–292 (1994)
Google Scholar
Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: RIAO 2000, vol. 2, pp. 943–961 (2000)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Informatik, Johannes Gutenberg-Universität Mainz, 55099, Mainz, Germany
Thomas Gottron
Faculty of Media, Media Systems, Bauhaus University Weimar, 99421, Weimar, Germany
Nedim Lipka

Authors

Thomas Gottron
View author publications
You can also search for this author in PubMed Google Scholar
Nedim Lipka
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Adaptive Information Cluster, Dublin City University, Dublin, 9, Ireland
Cathal Gurrin
The Open University, Walton Hall, MK7 6HF, Milton Keynes, UK
Yulan He
Microsoft Research Ltd, 7 JJ Thomson Avenue, CB3 0FB, Cambridge, UK
Gabriella Kazai
Department of Computer Science, University of Essex, Wivenhoe Park, CO4 3SQ, Colchester, UK
Udo Kruschwitz
The Open University, Walton Hall, Milton Keynes, UK
Suzanne Little
University of London, London, UK
Thomas Roelleke
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Department of Computing Science, University of Glasgow, 17 Lilybank Gardens, G12 8QQ, Glasgow, UK
Keith van Rijsbergen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gottron, T., Lipka, N. (2010). A Comparison of Language Identification Approaches on Short, Query-Style Texts. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_59

Download citation

DOI: https://doi.org/10.1007/978-3-642-12275-0_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12274-3
Online ISBN: 978-3-642-12275-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics