Skip to main content
Log in

Examining topic shifts in content-oriented XML retrieval

  • Regular Paper
  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Content-oriented XML retrieval systems support access to XML repositories by retrieving, in response to user queries, XML document components (XML elements) instead of whole documents. The retrieved XML elements should not only contain information relevant to the query, but also provide the right level of granularity. In INEX, the INitiative for the Evaluation of XML retrieval, a relevant element is defined to be at the right level of granularity if it is exhaustive and specific to the query. Specificity was specifically introduced to capture how focused an element is on the query (i.e., discusses no other irrelevant topics). To score XML elements according to how exhaustive and specific they are given a query, the content and logical structure of XML documents have been widely used. One source of evidence that has led to promising results with respect to retrieval effectiveness is element length. This work aims at examining a new source of evidence deriving from the semantic decomposition of XML documents. We consider that XML documents can be semantically decomposed through the application of a topic segmentation algorithm. Using the semantic decomposition and the logical structure of XML documents, we propose a new source of evidence, the number of topic shifts in an element, to reflect its relevance and more particularly its specificity. This paper has three research objectives. Firstly, we investigate the characteristics of XML elements reflected by their number of topic shifts. Secondly, we compare topic shifts to element length, by incorporating each of them as a feature in a retrieval setting and examining their effects in estimating the relevance of XML elements given a query. Finally, we use the number of topic shifts as evidence for capturing specificity to provide a focused access to XML repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arvola, P., Junkkari, M., Kekäläinen, J.: Generalized contextualisation method for XML information retrieval. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pp. 20–27 (2005)

  2. Ashoori, E., Lalmas, M.: Using topic shifts for focussed access to XML repositories. In: Advances in Information Retrieval: Proceedings 29th European Conference on IR Research (ECIR), LNCS, vol. 4425. pp. 444–455. Springer, Berlin (2007)

  3. Ashoori, E., Lalmas, M.: Using topic shifts in XML retrieval at INEX 2006. Comparative Evaluation of XML Information Retrieval Systems: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2006), LNCS, vol. 4518. Springer, Berlin (2007)

  4. Baeza-Yates R.A., Fuhr N. and Maarek Y.S. (2002). SIGIR XML and Information Retrieval workshop. SIGIR Forum 36(2): 53–57

    Article  Google Scholar 

  5. Baeza-Yates R.A., Maarek Y.S., Rölleke T. and de Vries A.P. (2004). SIGIR joint XML and Information Retrieval workshop and Integration of IR and DB workshop. SIGIR Forum 38(2): 24–30

    Article  Google Scholar 

  6. Blanken, H.M., Grabs, T., Schek, H.-J., Schenkel, R., Weikum, G. (eds.): Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks LNCS, vol. 2818. Springer, Berlin (2003)

  7. Callan, J.P.: Passage-level evidence in document retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 302–310 (1994)

  8. Caracciolo, C., de Rijke, M.: Generating and retrieving text segments for focused access to scientific documents. In: Advances in Information Retrieval: Proceedings 28th European Conference on IR Research (ECIR), LNCS, vol. 3936, pp. 350–361. Springer, Berlin (2006)

  9. Carmel D., Maarek Y.S. and Soffer A. (2000). SIGIR XML and Information Retrieval. SIGIR Forum 34(1): 31–36

    Article  Google Scholar 

  10. Chiaramella, Y., Mulhem, P., Fourel, F.: A model for multimedia information retrieval. Technical report, University of Glasgow, 1996. FERMI

  11. Croft W.B. and Lafferty J. (2003). Language Modeling for Information Retrieval. Kluwer, Dordrecht

    MATH  Google Scholar 

  12. Efron B. and Tibshirani R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall, Boca Raton

    MATH  Google Scholar 

  13. Fuhr, N., Gövert, N., Kazai, G., Lalmas, M. (eds.): In: Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, December 9–11 (2002)

  14. Fuhr N. and Lalmas M. (2004). Report on the INEX 2003 workshop, Schloss Dagstuhl, 15–17 December 2003. SIGIR Forum 38(1): 42–47

    Article  Google Scholar 

  15. Fuhr, N., Lalmas, M., Malik, S. (eds.): Initiative for the Evaluation of XML Retrieval (INEX). In: Proceedings of the Second INEX Workshop. Dagstuhl, Germany, December 15–17, (2003)

  16. Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.): Advances in XML Information Retrieval and Evaluation: Fourth Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2005), LNCS, vol. 3977. Springer, Berlin (2006)

  17. Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds.): Advances in XML Information Retrieval, Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, Dagstuhl Castle, Germany, December 6–8, 2004, LNCS, vol. 3493. Springer, Berlin (2005)

  18. Geva, S.: GPX—gardens point XML ir at inex 2005. In: Fuhr et al. [16], pp. 240–253

  19. Gövert, N., Fuhr, N., Abolhassani, M., Großjohann, K.: Content-oriented XML retrieval with HyREX. In: Fuhr et al. [13], pp. 26–32

  20. Halliday M. and Hasan R. (1976). Cohesion in English. Longman, London

    Google Scholar 

  21. Hatano, K., Kinutani, H., Amagasa, T., Mori, Y., Yoshikawa, M., Uemura, S.: Analyzing the properties of XML fragments decomposed from the INEX document collection. In: Fuhr et al. [17], pp. 168–182

  22. Hearst, M.A.: Multi-paragraph segmentation of expository text. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pp. 9–16 (1994)

  23. Hearst, M.A., Plaunt, C.: Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 59–68 (1993)

  24. Hiemstra, D.: A database approach to content-based XML retrieval. In: Fuhr et al. [13], pp. 111–118

  25. Hiemstra, D.: Using Language Models for Information Retrieval. Ph.D. thesis, University of Twente, (2001)

  26. Kamps J., de Rijke M. and Sigurbjörnsson B. (2005). The importance of length normalization for XML retrieval. Inf. Retr. 8(4): 631–654

    Article  Google Scholar 

  27. Kaszkiel M. and Zobel J. (2001). Effective ranking with arbitrary passages. J. Am. Soc. Inf. Sci. Technol. 52(4): 344–364

    Article  Google Scholar 

  28. Kazai G. and Lalmas M. (2006). Extended cumulated gain measures for the evaluation of content-oriented XML retrieval. ACM Trans. Inf. Syst. 24(4): 503–542

    Article  Google Scholar 

  29. Kazai, G., Lalmas, M.: INEX 2005 evaluation metrics. In: Fuhr et al. [16], pp. 16–29

  30. Kekäläinen, J., Junkkari, M., Arvola, P., Aalto, T.: TRIX 2004: Struggling with the overlap. In: Fuhr et al. [17], pp. 127–139

  31. Lalmas, M., Tombros, T.: INEX 2002 – 2006: Understanding XML Retrieval Evaluation DELOS Conference on Digital Libraries, 13–14 February 2007, Tirrenia, Pisa (Italy)

  32. Lalmas M. and Kazai G. (2006). Report on the ad-hoc track of the INEX 2005 workshop. ACM SIGIR Forum 40(1): 49–57

    Article  Google Scholar 

  33. List, J., Vries, A.P.: CWI at INEX 2002. In: Fuhr et al. [13], pp. 133–140

  34. Malik, S., Kazai, G., Lalmas, M., Fuhr, N.: Overview of INEX 2005. In: Fuhr et al. [16], pp. 1–15

  35. Mass, Y., Mandelbrod, M.: Retrieving the most relevant XML components. In: Fuhr et al. [15], pp. 53–58

  36. Mass, Y., Mandelbrod, M.: Using the INEX environment as a test bed for various user models for XML retrieval. In: Fuhr et al. [16], pp. 187–195

  37. Mihajlovic, V., Ramírez, G., Westerveld, T., Hiemstra, D., Blok, H.E., de Vries, A.P.: TIJAH Scratches INEX 2005: Vague Element Selection, Image Search, Overlap, and Relevance Feedback. In: Fuhr et al. [16], pp. 72–87

  38. Mittal, V., Kantrowitz, M., Goldstein, J., Carbonell, J.: Selecting text spans for document summaries: heuristics and metrics. In: Proceedings of the 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference, pp. 467–473 (1999)

  39. Monz, C., Dorr, B.J.: Iterative translation disambiguation for cross-language information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR), pp. 520–527 (2005)

  40. Morris J. and Hirst G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 17(1): 21–48

    Google Scholar 

  41. Ogilvie, P., Callan, J.: Hierarchical language models for XML component retrieval. In: Fuhr et al. [17], pp. 224–237

  42. Papadakis, I., Chrissikopoulos, V.: A digital library framework based on XML. In: Proceedings of the 3rd International Conference of Asian Digital Library (ICADL), pp. 81–88 (2000)

  43. Piwowarski, B., Lalmas, M.: Providing consistent and exhaustive relevance assessments for XML retrieval evaluation. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM), pp. 361–370 (2004)

  44. Ponte, J.M., Bruce Croft, W.: Text segmentation by topic. In: Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pp. 113–125 (1997)

  45. Ramirez, G., Westerveld, T., de Vries, A.P. Using structural relationships for focused XML retrieval. In: Proceedings of the 7th International Conference on Flexible Query Answering Systems (FQAS). LNCS, vol. 4027, pp. 147–158. Springer, Berlin (2006).

  46. Reynar, J.C.: Topic segmentation: algorithms and applications. Ph.D. thesis, Computer and Information Science, University of Pennsylvania (1998)

  47. Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information retrieval (SIGIR), pp. 49–58 (1993)

  48. Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Proceedings of the the 7th ACM Conference on Hypertext, pp. 53–65 (1996)

  49. Sauvagnat, K., Hlaoua, L., Boughanem, M.: XFIRM at INEX 2005: ad-hoc and relevance feedback tracks. In: Fuhr et al. [16], pp. 88–103

  50. Savoy J. (1997). Statistical inference in retrieval effectiveness evaluation. Inf. Process. Manage. 33(4): 495–512

    Article  Google Scholar 

  51. Sigurbjörnsson, B.: Focused Information Access using XML Element Retrieval. Ph.D. thesis, University of Amsterdam (2006)

  52. Sigurbjörnsson, B., Kamps, J., de Rijke, M.: The effect of structured queries and selective indexing on XML retrieval. In: Fuhr et al. [16], pp. 104–118

  53. Stairmand, M.: A Computational Analysis of Lexical Cohesion with Applications in Information Retrieval. Ph.D. thesis, University of Manchester (1996)

  54. Trotman, A.: Wanted: Element retrieval users. In: Proceedings of the INEX 2005 Workshop on Element Retrieval Methodology, Glasgow, July 2005

  55. Wilkinson, R.: Effective retrieval of structured documents. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 311–317 (1994)

  56. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 334–342 (2001)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mounia Lalmas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ashoori, E., Lalmas, M. & Tsikrika, T. Examining topic shifts in content-oriented XML retrieval. Int J Digit Libr 8, 39–60 (2007). https://doi.org/10.1007/s00799-007-0026-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-007-0026-5

Keywords

Navigation