Skip to main content

A Wikipedia-Based Multilingual Retrieval Model

  • Conference paper
Book cover Advances in Information Retrieval (ECIR 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4956))

Included in the following conference series:

Abstract

This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document \(d^*_i\) chosen from the “L-subset” of Wikipedia. Likewise, for a second document d′ written in language L′, \(L\not=L'\), we construct a concept vector d′, using from the L′-subset of the Wikipedia the topic-aligned counterparts \(d'^*_i\) of our previously chosen documents.

Since the two concept vectors d and d′ are collection-relative representations of d and d′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance.

We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ballesteros, L.: Resolving Ambiguity for Cross-Language Information Retrieval: A Dictionary Approach. PhD thesis, Director-W. Bruce Croft (2001)

    Google Scholar 

  2. Dumais, S., Letsche, T., Littman, M., Landauer, T.: Automatic cross-language retrieval using latent semantic indexing. In: AAAI 1997, Cross-Language, Text, and, Speech, Retrieval (1997)

    Google Scholar 

  3. Gabrilovich, E.: Feature Generation for Textual Information Retrieval Using World Knowledge. Phd thesis, Israel Institute of Technology (2006)

    Google Scholar 

  4. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI 2007, Hyderabad, India (2007)

    Google Scholar 

  5. Lavrenko, V., Choquette, M., Croft, W.: Cross-Lingual Relevance Models. In: SIGIR 2002, pp. 175–182. ACM Press, New York (2002)

    Chapter  Google Scholar 

  6. Levow, G.-A., Oard, D., Resnik, P.: Dictionary-based techniques for cross-language information retrieval. Inf. Process. Manage. 41(3), 523–547 (2005)

    Article  Google Scholar 

  7. McEnery, A., Xiao, R.: Parallel and comparable corpora: What are they up to? Incorporating Corpora: The Linguist and the Translator (2007)

    Google Scholar 

  8. Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. In: OntoIE 2003 at EUROLAN 2003, pp. 9–28 (2003)

    Google Scholar 

  9. Pouliquen, B., Steinberger, R., Ignat, C.: Automatic identification of document translations in large multilingual document collections. In: RANLP 2003, pp. 401–408 (2003)

    Google Scholar 

  10. Rehder, B., Littman, M., Dumais, S., Landauer, T.: Automatic 3-language cross-language information retrieval with latent semantic indexing. In: TREC, pp. 233–239 (1997)

    Google Scholar 

  11. Stein, B., zu Eissen, S.M., Potthast, M.: Strategies for retrieving plagiarized documents. In: SIGIR 2007, pp. 825–826 (2007)

    Google Scholar 

  12. Stein, B.: Principles of hash-based text retrieval. In: SIGIR 2007, pp. 527–534 (2007)

    Google Scholar 

  13. Steinberger, R., Pouliquen, B., Ignat, C.: Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications. In: 4th Language Technology Conference at Information Society, Slovenia (2004)

    Google Scholar 

  14. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis:A multilingual aligned parallel corpus with 20+languages. In: LREC 2006 (2006)

    Google Scholar 

  15. Vinokourov, A., Shawe-Taylor, J., Cristianini, N.: Inferring a semantic representation of text via cross-language correlation analysis. In: NIPS 2002, pp. 1473–1480. MIT Press, Cambridge (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Craig Macdonald Iadh Ounis Vassilis Plachouras Ian Ruthven Ryen W. White

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Potthast, M., Stein, B., Anderka, M. (2008). A Wikipedia-Based Multilingual Retrieval Model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78646-7_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78646-7_51

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78645-0

  • Online ISBN: 978-3-540-78646-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics