Abstract
XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of element length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length-bias introduced by the amount of smoothing, and show the importance of extreme length bias for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate element length normalization. Even after restricting the minimal size of XML elements occurring in the index, the importance of an extreme explicit length bias remains.
Article PDF
Similar content being viewed by others
References
Abolhassani M, Fuhr N and Malik S (2004) HyREX at INEX 2003. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 27–32.
Aniati G and Van Rijsbergen CJ (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20:357–389.
Berger A and Lafferty J (1999) Information retrieval as statistical translation. In: Proceedings of the 22nd Annual International AGM-SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp. 222–229.
Buckley C, Singhal A and Mitra M (1996) New Retrieval Approaches Using SMART: TREC 4. In: Harman DK, Ed., The Fourth Text REtrieval Conference (TREC-4), pp. 25–48.
Carmel D, Maarek Y, Mandelbrod M, Mass Y and Soffer A (2003) Searching XML documents via XML fragments. In: Clarke C, Cormack G, Callan J, Hawking D and Smeaton A, Eds., Proceedings of the 26th Annual International AGM-SIGIR Conference on Research and Development in Information Retrieval, pp. 151–158.
Efron B (1979) Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1–26.
Efron B and Tibshirani RJ (1993) An Introduction to the Bootstrap. Chapman and Hall, New York.
Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds. (2003) Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM.
Fuhr N, Lalmas M and Malik S, Eds. (2004), INEX 2003 Workshop Proceedings.
Gouml;vert N, Abolhassani M, Fuhr N and Grossjohan K (2003) Content-based XML retrieval with HyRex. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, pp. 26–32.
Greiff W and Morgan W (2003) Contributions of Language Modeling to the Theory and Practice of Information Retrieval. In: Croft W and Lafferty J, Eds., Language Modeling for Information Retrieval. Kluwer Academic Publishers, pp. 73–93.
Harman D (2003) Overview of the TREC 2002 Novelty Track. In: Voorhees E and Buckland L, Eds., The Eleventh Text REtrieval Conference (TREC-11).
Hiemstra D (2001) Using language models for information retrieval. Ph.D. thesis, University of Twente.
Hiemstra D (2003) A Database Approach to Content-based XML Retrieval. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, pp. 111–118.
Hiemstra D and Kraaij W (1999) Twenty-One at TREC-7: Ad-hoc and cross-language track. In: Voorhees E and Harman D, Eds., The Seventh Text REtrieval Conference (TREC-7), pp. 227–238.
INEX (2004) Initiative for the evaluation of XML retrieval, http://www.is.informatik.uni-duisburg.de/projects/inex03/.
Kamps J, de Rijke M and Sigurbjornsson B (2003a) Topic Field Selection and Smoothing for XML Retrieval. In de Vries AP, Ed., Proceedings of the 4th Dutch-Belgian Information Retrieval Workshop, pp. 69–75.
Kamps J, de Rijke M and Sigurbjornsson B (2004) Length Normalization in XML Retrieval. In: Proceedings 27th Annual International ACM SIGIR Conference (SIGIR 2004), pp. 80–87.
Kamps J, Marx M, de Rijke M and Sigurbjornsson B (2003b) XML Retrieval: What to Retrieve?. In: Clarke C, Cormack G, Callan J, Hawking D and Smeaton A, Eds., Proceedings of the 26th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 409–410.
Kraaij W (2004) Variations on Language Modeling for Information Retrieval. Ph.D. thesis, University of Twente.
Kraaij W, Pohlmann R and Hiemstra D (2000) Twenty-One at TREC-8: Using language technology for information retrieval. In: Voorhees E and Harman D, Eds., The Eighth Text REtrieval Conference (TREC-8), pp. 285–300.
Kraaij W and Westerveld T (2001) Twenty-UT at TREC-9: How different are web documents?. In: Voorhees E and Harman D, Eds. The Ninth Text REtrieval Conference (TREC-9), pp. 665–672.
Kraaij W, Westerveld T and Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 27–34.
Lafferty J and Zhai C (2003) Probabilistic relevance models based on document and query generation. In: Croft W and Lafferty J, Eds., Language Modeling for Information Retrieval. Kluwer Academic Publishers, pp. 1–10.
List J and de Vries A (2003) CWI at INEX 2002. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM., pp. 133–140.
List J, Mihajlovic V, Vries AD, Ramirez G and Hiemstra D (2004) The TIJAH XML-IR system at INEX 2003. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 102–109.
Mass Y and Mandelbrod M (2004) Retrieving the most relevant XML components. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 53–58.
Mass Y, Mandelbrod M, Amitay E, Carmel D, Maarek Y and Soffer A (2003) JuruXML–-an XML retrieval system at INEX’02. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, pp. 73–80.
Miller D, Leek T and Schwartz R (1999) A hidden Markov model information retrieval system. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 214–221.
Ogilvie P and Callan J (2003) Language Models and Structured Document Retrieval. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, pp. 33–44.
Ogilvie P and Callan J (2004) Using Language Models for flat text queries in XML retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 12–18.
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval, McGraw-Hill computer science series. McGraw-Hill, New York.
Savoy J (1997) Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and Management, 33:495–512.
Sigurbjornsson B, Kamps J and de Rijke M (2004) An Element-Based Approch to XML Retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 19–26.
Singhal A, Salton G, Mitra M and Buckley C (1996) Document length normalization. Information Processing & Management, 32:619–633.
Voorhees E (2003) Overview of the TREC 2002 Question Answering Track. In: Voorhees E and Buckland L, Eds. The Eleventh Text REtrieval Conference (TREC-11).
Wilbur J (1994) Non-parametric significance tests of retrievalperformance comparisons. Journal of Information Science, 20:270–284.
Wilkinson R (1994) Effective retrieval of structured documents. In: Proceedings of the 17th Annual International AGM-SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp. 311–317.
Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 334–342.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kamps, J., Rijke, M.d. & Sigurbjörnsson, B. The Importance of Length Normalization for XML Retrieval. Inf Retrieval 8, 631–654 (2005). https://doi.org/10.1007/s10791-005-0750-7
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-0750-7