Skip to main content
Log in

Morphological query expansion and language-filtering words for improving Basque web retrieval

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The experience of a user of major search engines or other web information retrieval services looking for information in the Basque language is far from satisfactory: they only return pages with exact matches but no inflections (necessary for an agglutinative language like Basque), many results in other languages (no search engine gives the option to restrict its results to Basque), etc. This paper proposes using morphological query expansion and language-filtering words in combination with the APIs of search engines as a very cost-effective solution to build appropriate web search services for Basque. The implementation details of the methodology (choosing the most appropriate language-filtering words, the number of them, the most frequent inflections for the morphological query expansion, etc.) have been specified by corpora-based studies. The improvements produced have been measured in terms of precision and recall both over corpora and real web searches. Morphological query expansion can improve recall up to 47 % and language-filtering words can raise precision from 15 % to around 90 %, although with a loss in recall of about 30–35 %. The proposed methodology has already been successfully used in the Basque search service Elebila (http://www.elebila.eu) and the web-as-corpus tool CorpEus (http://www.corpeus.org), and the approach could be applied to other morphologically rich or under-resourced languages as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://en.wikipedia.org/wiki/Inflection, date of consultation 11/26/2012.

  2. http://www.euskaracorpusa.net/XXmendea/Konts_arrunta_fr.html.

  3. http://www.ehu.es/euskara-orria/euskara/ereduzkoa/araka.html.

  4. http://www.ztcorpusa.net/.

  5. http://klasikoak.armiarma.com/corpus.htm.

  6. http://lexikoarenbehatokia.euskaltzaindia.net.

  7. http://lucene.apache.org/.

  8. http://www.niederlandistik.fu-berlin.de/cgi-bin/web-conc.cgi.

  9. http://www.webcorp.org.uk.

  10. http://www.kwicfinder.com.

  11. http://www.google.com/Top/World/Euskara/.

References

  • Aduriz, I., Aldezabal, I., Alegria, I., Artola, X., Ezeiza, N., & Urizar, R. (1996). EUSLEM: A lemmatiser/tagger for Basque. In Proceedings of Euralex conference, Göteborg, pp. 17–26.

  • Aduriz, I., Aldezabal, I., Ansa, O., Artola, X., & Diaz de Ilarraza, A. (1998). EDBL: A multi-purpose lexical support for the treatment of basque. In Proceedings of the first international conference on language resources and evaluation, Granada, vol. II, pp. 821–826.

  • Alegria, I., Artola, X., & Sarasola, K. (1996). Automatic morphological analysis of Basque. Literary & Linguistic Computing, 4(II), 193–203.

    Google Scholar 

  • Ambroziak, J., & Woods, W. A. (1998). Natural language technology in precision content retrieval. In Proceedings of the international conference on natural language processing and industrial applications, Moncton.

  • Areta, N., Gurrutxaga, A., Leturia, I., Alegria, I., Artola, X., Diaz de Ilarraza, A., et al. (2007). ZT corpus—annotation and tools for basque corpora. In Proceedings of corpus linguistics conference, Birmingham.

  • Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31(1), 13–28.

    Article  Google Scholar 

  • Belkin, N. J. (2000). Helping people find what they don’t know. Communications of the ACM, 43(8), 58–61.

    Article  Google Scholar 

  • Benczúr, A. A., Csalogány, K., Fogaras, D., Friedman, E., Sarlós, T., Uher, M. et al. (2003). Searching a small national domain—a preliminary report. In Proceedings of the 12th international World Wide Web conference, Budapest, pp. 184.

  • Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2).

  • Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of third annual symposium on document analysis and information retrieval, Las Vegas, pp. 161–175.

  • Efthimiadis, E. N., Malevris, N., Kousaridas, A., Lepeniotou, A., & Loutas, N. (2009). Non-english web search: An evaluation of indexing and searching the Greek web. Information Retrieval, 12(3), 352–379.

    Article  Google Scholar 

  • Fletcher, W. H. (2006). Concordancing the web: Promise and problems, tools and techniques. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 25–46). Amsterdam: Rodopi.

    Google Scholar 

  • Ghani, R., Jones, R., & Mladenić, D. (2003). Building minority language corpora by learning to generate Web search queries. Knowledge and Information Systems, 7(1), 56–83.

    Article  Google Scholar 

  • Jones, K. S., & Tait, J. I. (1984). Automatic search term variant generation. Journal of Documentation, 40(1), 50–66.

    Article  Google Scholar 

  • Kehoe, A., & Renouf, A. (2002). WebCorp: Applying the web to linguistics and linguistics to the web. In Proceedings of the WWW2002 Conference, Honolulu.

  • Kettunen, K., Airio, E., & Järvelin, K. (2007). Restricted inflectional form generation in management of morphological keyword variation. Information Retrieval, 10(4–5), 415–444.

    Article  Google Scholar 

  • Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the Web as corpus. Computational Linguistics, 29, 333–348.

    Article  Google Scholar 

  • Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, Pittsburgh, pp. 191–202.

  • Langer, S. (2001). Natural languages and the World Wide Web. Bulletin de linguistique appliquée et générale, 26, 89–100.

    Google Scholar 

  • Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding? Journal of Information Science, 33(5), 622–636.

    Article  Google Scholar 

  • Lazarinis, F., Vilares, J., & Tait, J. (2007). Improving non-English web searching (iNEWS07). ACM SIGIR Forum, 41(2), 72–76.

    Article  Google Scholar 

  • Leturia, I., Gurrutxaga, A., Alegria, I., & Ezeiza, A. (2007). CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque. In Proceedings of the 3rd Web as Corpus workshop, Louvain-la-Neuve, pp. 69–81.

  • Leturia, I., Gurrutxaga, A., Areta, A., Alegria, I., & Ezeiza, A. (2007). EusBila, a search service designed for the agglutinative nature of Basque. In Proceedings of iNEWS’07 workshop in SIGIR, Amsterdam, pp. 47–54.

  • Moreau, F., Claveau, V., & Sébillot, P. (2007). Automatic morphological query expansion using analogy-based machine learning. In Proceedings of ECIR 2007, Rome, pp. 222–233.

  • Osinski, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition. In Proceedings of the international conference on intelligent information systems, Zakopane, pp. 359–368.

  • Padró, M., & Padró, L. (2004). Comparing methods for language identification. Procesamiento del Lenguaje Natural, 33, 155–162.

    Google Scholar 

  • Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web WWW’04, New York, pp. 13–19.

  • Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), WaCky! Working papers on the Web as corpus (pp. 63–98). Bologna: Gedit Edizioni.

    Google Scholar 

  • Stanković, R. M. (2008). Improvement of queries using a rule based procedure for inflection of compounds and phrases. Research Journal on Computer Science and Computer Engineering with Applications, 37, 14–20.

    Google Scholar 

  • Uyar, A. (2009). Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4), 469–480.

    Article  Google Scholar 

  • Woods, W. A. (2000). Aggressive morphology for robust lexical coverage. In Proceedings of the sixth conference on applied natural language processing, Seattle, pp. 218–223.

  • Woods, W. A., Bookman, L. A., Houston, A., Kuhns, R. J., Martin, P., & Green, S. (2000). Linguistic knowledge can improve information retrieval. In Proceedings of the sixth conference on applied natural language processing, Seattle, pp. 262–267.

  • Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(1), 61–81.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Igor Leturia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leturia, I., Gurrutxaga, A., Areta, N. et al. Morphological query expansion and language-filtering words for improving Basque web retrieval. Lang Resources & Evaluation 47, 425–448 (2013). https://doi.org/10.1007/s10579-012-9208-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9208-x

Keywords

Navigation