Abstract
Content-oriented XML retrieval systems support access to XML repositories by retrieving, in response to user queries, XML document components (XML elements) instead of whole documents. The retrieved XML elements should not only contain information relevant to the query, but also provide the right level of granularity. In INEX, the INitiative for the Evaluation of XML retrieval, a relevant element is defined to be at the right level of granularity if it is exhaustive and specific to the query. Specificity was specifically introduced to capture how focused an element is on the query (i.e., discusses no other irrelevant topics). To score XML elements according to how exhaustive and specific they are given a query, the content and logical structure of XML documents have been widely used. One source of evidence that has led to promising results with respect to retrieval effectiveness is element length. This work aims at examining a new source of evidence deriving from the semantic decomposition of XML documents. We consider that XML documents can be semantically decomposed through the application of a topic segmentation algorithm. Using the semantic decomposition and the logical structure of XML documents, we propose a new source of evidence, the number of topic shifts in an element, to reflect its relevance and more particularly its specificity. This paper has three research objectives. Firstly, we investigate the characteristics of XML elements reflected by their number of topic shifts. Secondly, we compare topic shifts to element length, by incorporating each of them as a feature in a retrieval setting and examining their effects in estimating the relevance of XML elements given a query. Finally, we use the number of topic shifts as evidence for capturing specificity to provide a focused access to XML repositories.
Similar content being viewed by others
References
Arvola, P., Junkkari, M., Kekäläinen, J.: Generalized contextualisation method for XML information retrieval. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pp. 20–27 (2005)
Ashoori, E., Lalmas, M.: Using topic shifts for focussed access to XML repositories. In: Advances in Information Retrieval: Proceedings 29th European Conference on IR Research (ECIR), LNCS, vol. 4425. pp. 444–455. Springer, Berlin (2007)
Ashoori, E., Lalmas, M.: Using topic shifts in XML retrieval at INEX 2006. Comparative Evaluation of XML Information Retrieval Systems: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2006), LNCS, vol. 4518. Springer, Berlin (2007)
Baeza-Yates R.A., Fuhr N. and Maarek Y.S. (2002). SIGIR XML and Information Retrieval workshop. SIGIR Forum 36(2): 53–57
Baeza-Yates R.A., Maarek Y.S., Rölleke T. and de Vries A.P. (2004). SIGIR joint XML and Information Retrieval workshop and Integration of IR and DB workshop. SIGIR Forum 38(2): 24–30
Blanken, H.M., Grabs, T., Schek, H.-J., Schenkel, R., Weikum, G. (eds.): Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks LNCS, vol. 2818. Springer, Berlin (2003)
Callan, J.P.: Passage-level evidence in document retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 302–310 (1994)
Caracciolo, C., de Rijke, M.: Generating and retrieving text segments for focused access to scientific documents. In: Advances in Information Retrieval: Proceedings 28th European Conference on IR Research (ECIR), LNCS, vol. 3936, pp. 350–361. Springer, Berlin (2006)
Carmel D., Maarek Y.S. and Soffer A. (2000). SIGIR XML and Information Retrieval. SIGIR Forum 34(1): 31–36
Chiaramella, Y., Mulhem, P., Fourel, F.: A model for multimedia information retrieval. Technical report, University of Glasgow, 1996. FERMI
Croft W.B. and Lafferty J. (2003). Language Modeling for Information Retrieval. Kluwer, Dordrecht
Efron B. and Tibshirani R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall, Boca Raton
Fuhr, N., Gövert, N., Kazai, G., Lalmas, M. (eds.): In: Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, December 9–11 (2002)
Fuhr N. and Lalmas M. (2004). Report on the INEX 2003 workshop, Schloss Dagstuhl, 15–17 December 2003. SIGIR Forum 38(1): 42–47
Fuhr, N., Lalmas, M., Malik, S. (eds.): Initiative for the Evaluation of XML Retrieval (INEX). In: Proceedings of the Second INEX Workshop. Dagstuhl, Germany, December 15–17, (2003)
Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.): Advances in XML Information Retrieval and Evaluation: Fourth Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2005), LNCS, vol. 3977. Springer, Berlin (2006)
Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds.): Advances in XML Information Retrieval, Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, Dagstuhl Castle, Germany, December 6–8, 2004, LNCS, vol. 3493. Springer, Berlin (2005)
Geva, S.: GPX—gardens point XML ir at inex 2005. In: Fuhr et al. [16], pp. 240–253
Gövert, N., Fuhr, N., Abolhassani, M., Großjohann, K.: Content-oriented XML retrieval with HyREX. In: Fuhr et al. [13], pp. 26–32
Halliday M. and Hasan R. (1976). Cohesion in English. Longman, London
Hatano, K., Kinutani, H., Amagasa, T., Mori, Y., Yoshikawa, M., Uemura, S.: Analyzing the properties of XML fragments decomposed from the INEX document collection. In: Fuhr et al. [17], pp. 168–182
Hearst, M.A.: Multi-paragraph segmentation of expository text. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pp. 9–16 (1994)
Hearst, M.A., Plaunt, C.: Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 59–68 (1993)
Hiemstra, D.: A database approach to content-based XML retrieval. In: Fuhr et al. [13], pp. 111–118
Hiemstra, D.: Using Language Models for Information Retrieval. Ph.D. thesis, University of Twente, (2001)
Kamps J., de Rijke M. and Sigurbjörnsson B. (2005). The importance of length normalization for XML retrieval. Inf. Retr. 8(4): 631–654
Kaszkiel M. and Zobel J. (2001). Effective ranking with arbitrary passages. J. Am. Soc. Inf. Sci. Technol. 52(4): 344–364
Kazai G. and Lalmas M. (2006). Extended cumulated gain measures for the evaluation of content-oriented XML retrieval. ACM Trans. Inf. Syst. 24(4): 503–542
Kazai, G., Lalmas, M.: INEX 2005 evaluation metrics. In: Fuhr et al. [16], pp. 16–29
Kekäläinen, J., Junkkari, M., Arvola, P., Aalto, T.: TRIX 2004: Struggling with the overlap. In: Fuhr et al. [17], pp. 127–139
Lalmas, M., Tombros, T.: INEX 2002 – 2006: Understanding XML Retrieval Evaluation DELOS Conference on Digital Libraries, 13–14 February 2007, Tirrenia, Pisa (Italy)
Lalmas M. and Kazai G. (2006). Report on the ad-hoc track of the INEX 2005 workshop. ACM SIGIR Forum 40(1): 49–57
List, J., Vries, A.P.: CWI at INEX 2002. In: Fuhr et al. [13], pp. 133–140
Malik, S., Kazai, G., Lalmas, M., Fuhr, N.: Overview of INEX 2005. In: Fuhr et al. [16], pp. 1–15
Mass, Y., Mandelbrod, M.: Retrieving the most relevant XML components. In: Fuhr et al. [15], pp. 53–58
Mass, Y., Mandelbrod, M.: Using the INEX environment as a test bed for various user models for XML retrieval. In: Fuhr et al. [16], pp. 187–195
Mihajlovic, V., Ramírez, G., Westerveld, T., Hiemstra, D., Blok, H.E., de Vries, A.P.: TIJAH Scratches INEX 2005: Vague Element Selection, Image Search, Overlap, and Relevance Feedback. In: Fuhr et al. [16], pp. 72–87
Mittal, V., Kantrowitz, M., Goldstein, J., Carbonell, J.: Selecting text spans for document summaries: heuristics and metrics. In: Proceedings of the 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference, pp. 467–473 (1999)
Monz, C., Dorr, B.J.: Iterative translation disambiguation for cross-language information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR), pp. 520–527 (2005)
Morris J. and Hirst G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 17(1): 21–48
Ogilvie, P., Callan, J.: Hierarchical language models for XML component retrieval. In: Fuhr et al. [17], pp. 224–237
Papadakis, I., Chrissikopoulos, V.: A digital library framework based on XML. In: Proceedings of the 3rd International Conference of Asian Digital Library (ICADL), pp. 81–88 (2000)
Piwowarski, B., Lalmas, M.: Providing consistent and exhaustive relevance assessments for XML retrieval evaluation. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM), pp. 361–370 (2004)
Ponte, J.M., Bruce Croft, W.: Text segmentation by topic. In: Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pp. 113–125 (1997)
Ramirez, G., Westerveld, T., de Vries, A.P. Using structural relationships for focused XML retrieval. In: Proceedings of the 7th International Conference on Flexible Query Answering Systems (FQAS). LNCS, vol. 4027, pp. 147–158. Springer, Berlin (2006).
Reynar, J.C.: Topic segmentation: algorithms and applications. Ph.D. thesis, Computer and Information Science, University of Pennsylvania (1998)
Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information retrieval (SIGIR), pp. 49–58 (1993)
Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Proceedings of the the 7th ACM Conference on Hypertext, pp. 53–65 (1996)
Sauvagnat, K., Hlaoua, L., Boughanem, M.: XFIRM at INEX 2005: ad-hoc and relevance feedback tracks. In: Fuhr et al. [16], pp. 88–103
Savoy J. (1997). Statistical inference in retrieval effectiveness evaluation. Inf. Process. Manage. 33(4): 495–512
Sigurbjörnsson, B.: Focused Information Access using XML Element Retrieval. Ph.D. thesis, University of Amsterdam (2006)
Sigurbjörnsson, B., Kamps, J., de Rijke, M.: The effect of structured queries and selective indexing on XML retrieval. In: Fuhr et al. [16], pp. 104–118
Stairmand, M.: A Computational Analysis of Lexical Cohesion with Applications in Information Retrieval. Ph.D. thesis, University of Manchester (1996)
Trotman, A.: Wanted: Element retrieval users. In: Proceedings of the INEX 2005 Workshop on Element Retrieval Methodology, Glasgow, July 2005
Wilkinson, R.: Effective retrieval of structured documents. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 311–317 (1994)
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 334–342 (2001)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ashoori, E., Lalmas, M. & Tsikrika, T. Examining topic shifts in content-oriented XML retrieval. Int J Digit Libr 8, 39–60 (2007). https://doi.org/10.1007/s00799-007-0026-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-007-0026-5