Morphological query expansion and language-filtering words for improving Basque web retrieval

Leturia, Igor; Gurrutxaga, Antton; Areta, Nerea; Alegria, Iñaki; Ezeiza, Aitzol

doi:10.1007/s10579-012-9208-x

Morphological query expansion and language-filtering words for improving Basque web retrieval

Original Paper
Published: 04 December 2012

Volume 47, pages 425–448, (2013)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Igor Leturia¹,
Antton Gurrutxaga¹,
Nerea Areta¹,
Iñaki Alegria² &
…
Aitzol Ezeiza²

232 Accesses
7 Citations
Explore all metrics

Abstract

The experience of a user of major search engines or other web information retrieval services looking for information in the Basque language is far from satisfactory: they only return pages with exact matches but no inflections (necessary for an agglutinative language like Basque), many results in other languages (no search engine gives the option to restrict its results to Basque), etc. This paper proposes using morphological query expansion and language-filtering words in combination with the APIs of search engines as a very cost-effective solution to build appropriate web search services for Basque. The implementation details of the methodology (choosing the most appropriate language-filtering words, the number of them, the most frequent inflections for the morphological query expansion, etc.) have been specified by corpora-based studies. The improvements produced have been measured in terms of precision and recall both over corpora and real web searches. Morphological query expansion can improve recall up to 47 % and language-filtering words can raise precision from 15 % to around 90 %, although with a loss in recall of about 30–35 %. The proposed methodology has already been successfully used in the Basque search service Elebila (http://www.elebila.eu) and the web-as-corpus tool CorpEus (http://www.corpeus.org), and the approach could be applied to other morphologically rich or under-resourced languages as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

Natural Language Processing

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Article Open access 12 April 2022

Markus Bayer, Marc-André Kaufhold, … Christian Reuter

Notes

References

Aduriz, I., Aldezabal, I., Alegria, I., Artola, X., Ezeiza, N., & Urizar, R. (1996). EUSLEM: A lemmatiser/tagger for Basque. In Proceedings of Euralex conference, Göteborg, pp. 17–26.
Aduriz, I., Aldezabal, I., Ansa, O., Artola, X., & Diaz de Ilarraza, A. (1998). EDBL: A multi-purpose lexical support for the treatment of basque. In Proceedings of the first international conference on language resources and evaluation, Granada, vol. II, pp. 821–826.
Alegria, I., Artola, X., & Sarasola, K. (1996). Automatic morphological analysis of Basque. Literary & Linguistic Computing, 4(II), 193–203.
Google Scholar
Ambroziak, J., & Woods, W. A. (1998). Natural language technology in precision content retrieval. In Proceedings of the international conference on natural language processing and industrial applications, Moncton.
Areta, N., Gurrutxaga, A., Leturia, I., Alegria, I., Artola, X., Diaz de Ilarraza, A., et al. (2007). ZT corpus—annotation and tools for basque corpora. In Proceedings of corpus linguistics conference, Birmingham.
Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31(1), 13–28.
Article Google Scholar
Belkin, N. J. (2000). Helping people find what they don’t know. Communications of the ACM, 43(8), 58–61.
Article Google Scholar
Benczúr, A. A., Csalogány, K., Fogaras, D., Friedman, E., Sarlós, T., Uher, M. et al. (2003). Searching a small national domain—a preliminary report. In Proceedings of the 12th international World Wide Web conference, Budapest, pp. 184.
Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2).
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of third annual symposium on document analysis and information retrieval, Las Vegas, pp. 161–175.
Efthimiadis, E. N., Malevris, N., Kousaridas, A., Lepeniotou, A., & Loutas, N. (2009). Non-english web search: An evaluation of indexing and searching the Greek web. Information Retrieval, 12(3), 352–379.
Article Google Scholar
Fletcher, W. H. (2006). Concordancing the web: Promise and problems, tools and techniques. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 25–46). Amsterdam: Rodopi.
Google Scholar
Ghani, R., Jones, R., & Mladenić, D. (2003). Building minority language corpora by learning to generate Web search queries. Knowledge and Information Systems, 7(1), 56–83.
Article Google Scholar
Jones, K. S., & Tait, J. I. (1984). Automatic search term variant generation. Journal of Documentation, 40(1), 50–66.
Article Google Scholar
Kehoe, A., & Renouf, A. (2002). WebCorp: Applying the web to linguistics and linguistics to the web. In Proceedings of the WWW2002 Conference, Honolulu.
Kettunen, K., Airio, E., & Järvelin, K. (2007). Restricted inflectional form generation in management of morphological keyword variation. Information Retrieval, 10(4–5), 415–444.
Article Google Scholar
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the Web as corpus. Computational Linguistics, 29, 333–348.
Article Google Scholar
Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, Pittsburgh, pp. 191–202.
Langer, S. (2001). Natural languages and the World Wide Web. Bulletin de linguistique appliquée et générale, 26, 89–100.
Google Scholar
Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding? Journal of Information Science, 33(5), 622–636.
Article Google Scholar
Lazarinis, F., Vilares, J., & Tait, J. (2007). Improving non-English web searching (iNEWS07). ACM SIGIR Forum, 41(2), 72–76.
Article Google Scholar
Leturia, I., Gurrutxaga, A., Alegria, I., & Ezeiza, A. (2007). CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque. In Proceedings of the 3rd Web as Corpus workshop, Louvain-la-Neuve, pp. 69–81.
Leturia, I., Gurrutxaga, A., Areta, A., Alegria, I., & Ezeiza, A. (2007). EusBila, a search service designed for the agglutinative nature of Basque. In Proceedings of iNEWS’07 workshop in SIGIR, Amsterdam, pp. 47–54.
Moreau, F., Claveau, V., & Sébillot, P. (2007). Automatic morphological query expansion using analogy-based machine learning. In Proceedings of ECIR 2007, Rome, pp. 222–233.
Osinski, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition. In Proceedings of the international conference on intelligent information systems, Zakopane, pp. 359–368.
Padró, M., & Padró, L. (2004). Comparing methods for language identification. Procesamiento del Lenguaje Natural, 33, 155–162.
Google Scholar
Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web WWW’04, New York, pp. 13–19.
Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), WaCky! Working papers on the Web as corpus (pp. 63–98). Bologna: Gedit Edizioni.
Google Scholar
Stanković, R. M. (2008). Improvement of queries using a rule based procedure for inflection of compounds and phrases. Research Journal on Computer Science and Computer Engineering with Applications, 37, 14–20.
Google Scholar
Uyar, A. (2009). Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4), 469–480.
Article Google Scholar
Woods, W. A. (2000). Aggressive morphology for robust lexical coverage. In Proceedings of the sixth conference on applied natural language processing, Seattle, pp. 218–223.
Woods, W. A., Bookman, L. A., Houston, A., Kuhns, R. J., Martin, P., & Green, S. (2000). Linguistic knowledge can improve information retrieval. In Proceedings of the sixth conference on applied natural language processing, Seattle, pp. 262–267.
Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(1), 61–81.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Elhuyar Foundation, Usurbil, Gipuzkoa, Spain
Igor Leturia, Antton Gurrutxaga & Nerea Areta
University of the Basque Country, Donostia/San Sebastian, Gipuzkoa, Spain
Iñaki Alegria & Aitzol Ezeiza

Authors

Igor Leturia
View author publications
You can also search for this author in PubMed Google Scholar
Antton Gurrutxaga
View author publications
You can also search for this author in PubMed Google Scholar
Nerea Areta
View author publications
You can also search for this author in PubMed Google Scholar
Iñaki Alegria
View author publications
You can also search for this author in PubMed Google Scholar
Aitzol Ezeiza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Igor Leturia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leturia, I., Gurrutxaga, A., Areta, N. et al. Morphological query expansion and language-filtering words for improving Basque web retrieval. Lang Resources & Evaluation 47, 425–448 (2013). https://doi.org/10.1007/s10579-012-9208-x

Download citation

Published: 04 December 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10579-012-9208-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Morphological query expansion and language-filtering words for improving Basque web retrieval

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Morphological query expansion and language-filtering words for improving Basque web retrieval

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation