Application of Variable Length N-Gram Vectors to Monolingual and Bilingual Information Retrieval

Gayo-Avello, Daniel; Álvarez-Gutiérrez, Darío; Gayo-Avello, José

doi:10.1007/11519645_7

Daniel Gayo-Avello²²,
Darío Álvarez-Gutiérrez²² &
José Gayo-Avello²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3491))

Included in the following conference series:

Workshop of the Cross-Language Evaluation Forum for European Languages

632 Accesses

Abstract

Our group in the Department of Informatics at the University of Oviedo has participated, for the first time, in two tasks at CLEF: monolingual (Russian) and bilingual (Spanish-to-English) information retrieval. Our main goal was to test the application to IR of a modified version of the n-gram vector space model (codenamed blindLight). This new approach has been successfully applied to other NLP tasks such as language identification or text summarization and the results achieved at CLEF 2004, although not exceptional, are encouraging. There are two major differences between the blindLight approach and classical techniques: (1) relative frequencies are no longer used as vector weights but are replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques, not so computationally expensive. In order to perform cross-language IR we have developed a naive n-gram pseudo-translator similar to those described by McNamee and Mayfield or Pirkola et al.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Research on Cross-Language Retrieval Using Bilingual Word Vectors in Different Languages

Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF

I Can Guess What You Mean: A Monolingual Query Enhancement for Machine Translation

References

Salton, G., Wong, A., Yang, C.S.: A vector space model for information retrieval. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
D’Amore, R., Mah, C.P.: One-time complete indexing of text: Theory and practice. In: Proc. of SIGIR 1985, pp. 155–164 (1985)
Google Scholar
Kimbrell, R.E.: Searching for text? Send an n-gram! Byte 13(5), 297–312 (1988)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Google Scholar
Ferreira da Silva, J., Pereira Lopes, G.: A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In: Proc. of MOL6 (1999)
Google Scholar
Ferreira da Silva, J., Pereira Lopes, G.: Extracting Multiword Terms from Document Collections. In: Proc. of VExTAL, Venice, Italy (1999)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals (English translation from Russian). Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J.: Naive Algorithms for Key phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process. In: Ijspeert, A.J., Murata, M., Wakamiya, N. (eds.) BioADIT 2004. LNCS, vol. 3141, pp. 440–455. Springer, Heidelberg (2004)
Chapter Google Scholar
Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J.: One Size Fits All? A Simple Technique to Perform Several NLP Tasks. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 267–278. Springer, Heidelberg (2004)
Chapter Google Scholar
Peters, C., Braschler, M., Di Nunzio, G., Ferro, N.: CLEF 2004: Ad Hoc Track Overview and Results Analysis. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 10–26. Springer, Heidelberg (2005)
Chapter Google Scholar
Peters, C.: What happened in CLEF 2004. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds.) CLEF 2004. LNCS, vol. 3491, pp. 1–9. Springer, Heidelberg (2005)
Chapter Google Scholar
Koehn, P.: Europarl: A Multilingual Corpus for Evaluation of Machine Translation, Draft (unpublished), http://www.isi.edu/~koehn/publications/europarl.ps
Pirkola, A., Keskustalo, H., Leppänen, E., Känsälä, A., Järvelin, K.: Targeted s gram matching: a novel n-gram matching technique for cross- and monolingual word form variants. Information Research 7(2) (2002)
Google Scholar
McNamee, P., Mayfield, J.: JHU/APL Experiments in Tokenization and Non-Word Translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Oviedo, Calvo Sotelo, s/n, 33007, Oviedo, Spain
Daniel Gayo-Avello, Darío Álvarez-Gutiérrez & José Gayo-Avello

Authors

Daniel Gayo-Avello
View author publications
You can also search for this author in PubMed Google Scholar
Darío Álvarez-Gutiérrez
View author publications
You can also search for this author in PubMed Google Scholar
José Gayo-Avello
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ISTI-CNR, Area di Ricerca, Pisa, Italy
Carol Peters
Sheffield University, Sheffield, United Kingdom
Paul Clough
No Affiliations,
Julio Gonzalo
Centre for Digital Video Processing & School of Computing, Dublin City University, Dublin 9, Ireland
Gareth J. F. Jones
German Institute for International and Security Affairs, Stiftung Wissenschaft und Politik (SWP), Ludwigkirchplatz 3-4, P.O. Box, 10719, Berlin, Germany
Michael Kluck
ITC-IRST, Trento, Italy
Bernardo Magnini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gayo-Avello, D., Álvarez-Gutiérrez, D., Gayo-Avello, J. (2005). Application of Variable Length N-Gram Vectors to Monolingual and Bilingual Information Retrieval. In: Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B. (eds) Multilingual Information Access for Text, Speech and Images. CLEF 2004. Lecture Notes in Computer Science, vol 3491. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11519645_7

Download citation

DOI: https://doi.org/10.1007/11519645_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27420-9
Online ISBN: 978-3-540-32051-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics