Abstract
Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalization and compound splitting, to knowledge-free approaches, such as n-gram indexing. Evaluations are carried out against data from the CLEF campaign, covering eight European languages. Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques.
Article PDF
Similar content being viewed by others
References
Airio E, Keskustalo H, Hedlund T and Pirkola A (2002) Utaclir @ CLEF 2002-towards a uniform translation process model. In: Peters 2002, Ed. pp. 51-58.
Amati G, Carpinetto C and Romano G (2002) Italian monolingual retrieval with PROSIT. In: Peters 2002, Ed. pp. 145-152.
Bacchin M, Ferro N and Melucci M (2002) University of Padua at CLEF-2002: Experiments to evaluate a statistical stemming procedure. In: Peters 2002, Ed. pp. 161-168.
Bell C and Jones KP (1979) Toward everyday language information retrieval systems via minicomputers. Journal of the American Society for Information Science, 30:334–338.
Braschler M and Ripplinger B (2003) Stemming and decompounding for German text retrieval. In: Advances in Information Retrieval, 25th BCS-IRSG European Colloquium on IR Research (ECIR), pp. 177-192.
Buckley C, Singhal A and MitraM(1995) New retrieval approaches using SMART: TREC-4'. In: Harman 1995b, pp. 25-48. NIST Special Publication 500-225.
Burnett JE, Cooper D, Lynch MF, Willett P and Wycherley M (1979) Document retrieval experiments using indexing vocabularies of varying size. I. Variety generation symbols assigned to the fronts of index terms. Journal of Documentation, 35(3):197–206.
Chen A (2002) Cross-language retrieval experiments at CLEF-2002. In: Peters 2002, pp. 5-20.
CLEF-Neuchâtel (2003) CLEF Resources at the University of Neuchâtel. http://www.unine.ch/info/clef (visited February 1, 2003).
Damashek M (1995) Gauging similarity via N-Grams: Language independent categorization of text. Science, 267:843–848.
Davison A and Hinkley D (1997) Bootstrap Methods and Their Application. Cambridge University Press.
De Heer T (1982) The application of the concept of homeosemy to natural language information retrieval. Information Processing & Management, 18(5):229–236.
Demske U (1995) Word vs. phrase structure: The rise of genitive compounds in German. ZAS Papers in Linguistics, 3:1–28.
Efron B (1979) Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1):1–26.
Fagan J (1987) Experiments in automatic phrase indexing for document retrieval: A comparison of syntactic and non-syntactic methods. Ph.D. Thesis, Department of Computer Science, Cornell University.
Figuerola CG, Gómez R, Rodríguez AFZ and Berrocal JLA (2002) Spanish monolingual track: The impact of stemming on retrieval. In: Peters et al. 2002, Eds. Springer, pp. 253-261.
Frakes WB (1992) Stemming algorithms. In: Frakes WB and Baeza-Yates R, Eds. Information Retrieval, Data Structures and Algorithms, Prentice-Hall, pp. 131-160.
Harman DK (1991) How effective is suffixing. Journal of the American Society for Information Science, 42(1):7–15.
Harman DK (1994)Overview of the second Text REtrieval Conference (TREC-2). In: Harman DK, Ed. Proceedings of the Second Text REtrieval Conference (TREC-2), NIST Special Publication 500-215, pp. 1-20.
Harman DK (1995a) Overview of the third Text REtrieval Conference (TREC-3). In: Harman 1995b, NIST Special Publication 500-225, pp. 1-20.
Harman DK (1995b), Ed. Proceedings of the third Text REtrieval Conference (TREC-3). NIST Special Publication 500-225.
Hedlund T (2002) Compounds in dictionary-based cross-language information retrieval. Information Research, 7(2). Available at http://InformationR.net/ir/7-2/paper128.html (visited February 1, 2003).
Hedlund T, Keskustalo H, Pirkola A, Airio E and Järvelin K (2002) Utaclir @ CLEF 2001-effects of compound splitting and N-Gram techniques. In: Peters et al. 2002, Ed. Springer, pp. 118-136.
Hull D (1996) Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Sience, 47(1):70–84.
Jelinek F (1990) Self-organized language modeling for speech recognition. In: Waibel A and Lee K-F, Eds. Readings in Speech Recognition, Morgan Kaufmann, pp. 450-506.
Josefsson G (1997) On the principles of word formation in Swedish. Lund University Press, Lund.
Jurafsky D and Martin JH (2000) Speech and Language Processing. Prentice-Hall.
Koehn P and Knight K (2003) Empirical methods for compound splitting. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL).
Kotamarti U and Tharp AL (1990) Accelerated text searching through signature trees. Journal of the American Society for Information Science, 41:79–86.
Kraaij W and Pohlmann R (1996) Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 40-48.
Kraaij W and Pohlmann R (1998) Comparing the effect of syntactic vs. statistical phrase index strategies for Dutch. In: Proceedings ECDL'98, pp. 605-617.
Krott A, Baayen RH and Schreuder R (2001) Analogy in morphology: Modelling the choice of linking morpheme in Dutch. Linguistics, 39:51–93.
Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings SIGIR'93, pp. 191-202.
Matthews PH (1991) Morphology. Cambridge University Press.
Mayfield J and McNamee P (1999) Indexing using both n-grams and words. In: Voorhees and Harman 1999, Ed., pp. 419-424. NIST Special Publication 500-242.
McNamee P and Mayfield J (2002a), JHU/APL Experiments at CLEF: Translation resources and score normalization. In: Peters et al. 2002, Ed. Springer, pp. 193-208.
McNamee P and Mayfield J (2002b) Scalable multilingual information access. In: Peters 2002, Ed. pp. 133-140.
Monz C and de Rijke M (2002) Shallow morphological analysis in monolingual retrieval for Dutch, German, and Italian. In: Peters et al. 2002, Eds. Springer, pp. 262-277.
Monz C, de Rijke M, Kamps J, van Hage W and Hollink V (2002) The FlexIR information retrieval system. Manual, Language & Inference Technology Group, ILLC, University of Amsterdam.
Mooney C and Duval R (1993) Bootstrapping: A Nonparametric Approach to Statistical Inference. Sage Quantitative Applications in the Social Science Series No. 95. Sage Publications.
Moulinier I, McCulloh J, and Lund E (2001) West group at CLEF 2000: Non-english monolingual retrieval. In: Peters 2001, Ed. Springer, pp. 253-260.
Peters C (2001), Ed. Cross-language information retrieval and evaluation, workshop of the cross-language evaluation forum, CLEF 2000, Vol. 2069 of LNCS. Springer.
Peters C (2002), Ed. Results of the CLEF 2002 Cross-Language System Evaluation Campaign.
Peters C and Braschler M (2001) Cross-language system evaluation: TheCLEFcampaigns. Journal of the American Society for Information Science and Technology, 52(12):1067–1072.
Peters C, Braschler M, Gonzalo J and Kluck M (2002), Eds. Evaluation of cross-language information retrieval systems, second workshop of the cross-language evaluation forum, CLEF 2001', Vol. 2406 of LNCS. Springer.
Pirkola A (1999) Studies on linguistic problems and methods in text retrieval. Ph.D. Thesis, University of Tampere.
Pirkola A (2001) Morphological typology of languages for IR. Journal of Documentation, 57(3):330–348.
Popovic M and Willett P (1992) The effectiveness of stemming for natural-language to Slovene textual data. Journal of the American Society for Information Sience, 43(5):384–390.
Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137.
Rocchio Jr, JJ (1971) Relevance feedback in information retrieval. In: Salton G, Ed. The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall Series in Automatic Computation. Prentice-Hall, Englewood Cliffs NJ, chapt. 14, pp. 313–323.
Savoy J (1997) Statistical inference in retrieval effectiveness evaluation. Information Processing and Management, 33(4):495–512.
Savoy J (1999) A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10):944–952.
Savoy J (2002a) Report on CLEF-2001 Experiments: Effective combined query-translation approach. In: Peters et al. 2002, Eds. Springer, pp. 27-43.
Savoy J (2002b) Report on CLEF-2002 experiments: Combining multiple sources of evidence. In: Peters 2002, Ed., pp. 31-46.
Schmid H (1994), Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing.
Shannon CE (1951), Prediction and entropy of printed English. The Bell System Technical Journal, 30:50–64.
Snowball Stemmers, http://snowball.tartarus.org/ (visited February 1, 2003).
Strzalkowski T (1995) Natural language information retrieval. Information Processing & Management, 31(3):397–417.
Tomlinson S (2002a) Experiments in 8 European languages with Hummingbird SearchServerTM at CLEF2002. In: Peters 2002, Ed. pp. 203-214.
Tomlinson S (2002b) Stemming evaluated in 6 languages by Hummingbird SearchServerTM at CLEF2001. In: Peters 2002, Ed. Springer, pp. 278-287.
Ullman JR (1977) Binary n-gram technique for automatic correction of substitution, deletion, insertion, and reversal errors in words. Computer Journal, 20:141–147.
Voorhees EM and Harman DK (1998) Overview of the sixth Text REtrieval Conference (TREC-6). In: Voorhees EM and Harman DK Eds. Proceedings of the Sixth Text REtrieval Conference (TREC-6), pp. 1-28. NIST Special Publication 500-240.
Voorhees EM and Harman DK (1999), Eds. Proceedings of the seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242.
Whaley LJ (1997) Introduction to Typology: The Unity and Diversity of Language. Sage Publications.
Wilbur J (1994) Non-parametric significance tests of retrieval performance comparisons. Journal of Information Science, 20(4):270–284.
Willet P (1979) Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. Journal of Documentation, 35:296–305.
Wisniewski JL (1987) Effective text compression with simultaneous digram and trigram encoding. Journal of Information Science, 13:159–164.
Womser-Hacker C (2002) Multilingual topic generation within the CLEF 2001 experiments. In: Peters 2002, Ed. Springer, pp. 389-393.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Hollink, V., Kamps, J., Monz, C. et al. Monolingual Document Retrieval for European Languages. Information Retrieval 7, 33–52 (2004). https://doi.org/10.1023/B:INRT.0000009439.19151.4c
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000009439.19151.4c