Abstract
Most recent document standards like XML rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex document types. The design of such systems is still an open problem. We present a new model for structured document retrieval which allows computing scores of document parts. This model is based on Bayesian networks whose conditional probabilities are learnt from a labelled collection of structured documents—which is composed of documents, queries and their associated assessments. Training these models is a complex machine learning task and is not standard. This is the focus of the paper: we propose here to train the structured Bayesian Network model using a cross-entropy training criterion. Results are presented on the INEX corpus of XML documents.
Article PDF
Similar content being viewed by others
References
Callan JP, Croft WB and Harding SM (1992) The INQUERY retrieval system. In: Min Tjoa A and Isidro Ramos, Eds., Database and Expert Systems Applications, Proceedings of the International Conference, Valencia, Spain. Springer-Verlag, pp. 78–83.
Crestani F, de Campos LM, Ferna’ndez-Luna JM and Huete JF (2003) A multi-layered bayesian network model for structured document retrieval. In: Nielsen TD and Zhang NL, Eds., Symbolic and Quantitative Approaches to Reasoning with Uncertainty: 7th European Conference, ECSQARU 2003, Aalborg, Denmark, Springer-Verlag, pp. 74–86.
Crestani F, de Campos LM, Ferna’ndez-Luna JM and Huete JF (2003a) Ranking structured documents using utility theory in the bayesian network retrieval model. In: Nascimento MA, de Moura ES and Oliveira AL, Eds. SPIRE(String Processing and Information Retrieval) 2003, volume 2857 of Lecture Notes in Computer Science, Brazil, Springer-Verlag Heidelberg, pp. 168–182.
Culioli J-C (1994) Introduction à l’optimisation. Ellipses.
de Campos LM, Ferna’ndez-Luna JM and Huete JF (2003b) The BNR model: Foundations and performance of a bayesian network-based retrieval model. International Journal of Approximate Reasoning, 34(2):265–285.
de Campos LM, Ferna’ndez-Luna JM and Huete JF (2003) Improving the efficiency of the bayesian network retrieval model by reducing relationships between terms. International Journal of Uncertainty Fuzziness Knowledge-Based Systems, 11(Supplement):101–116.
Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via de EM algorithm. The Journal of Royal Statistical Society, 39:1–37.
Fuhr N and Malik S (2003) Overview of the initiative for the evaluation of XML retrieval (INEX 2003). In: INitiative for the Evaluation of XML Retrieval (INEX). Proceedings of the Second INEX Workshop.
Fuhr N and Rölleke T (1998) Hyspirit–-a probabilistic inference engine for hypermedia retrieval in large databases. In: Schek H-J, Saltor F, Ramos I and Alonso G, Eds., Proceedings of the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain, Springer, Berlin.
Gövert N and Kazai G (2002) Overview of the initiative for the evaluation of XML retrieval (INEX 2002). In: Proceedings of the First Annual Workshop of the Initiative for the Evaluation of XML retrieval (INEX), DELOS workshop, Dagstuhl, Germany, ERCIM.
Gövert N, Kazai G, Fuhr N and Lalmas M (2003) Evaluating the effectiveness of content-oriented XML retrieval. Technical report, University of Dortmund, Computer Science 6.
Indrawan M, Ghazfan D and Srinivasan B (1994) Using bayesian networks as retrieval engines. In: ACIS 5th Australasian Conference on Information Systems, Melbourne, Australia, pp. 259–271.
Jensen FV (1996) An Introduction to Bayesian Networks. UCL Press, London, England.
Kazai G (2003) Report on the INEX 2003 metrics group. In: INitiative for the Evaluation of XML Retrieval (INEX). Proceedings of the Second INEX Workshop.
Kazai G, Lalmas M and Vries AP (2004) The overlap problem in content-oriented XML retrieval evaluation. In: INitiative for the Evaluation of XML Retrieval (INEX). Proceedings of the Second INEX Workshop.
Krause PJ (1998) Learning probabilistic networks. The Knowledge Engineering Review, 13(4):321–351.
Lalmas M (1997) Dempster-shafer’s theory of evidence applied to structured documents: Modelling uncertainty. In: Proceedings of the 20th Annual International ACM SIGIR, Philadelphia, PA, USA, ACM, pp. 110–118.
Myaeng SH, Jang D-H, Kim M-S and Zhoo Z-C (1998) A flexible model for retrieval of SGML documents. In: Croft WB, Moffat A, van Rijsbergen CJ, Wilkinson R and Zobel J, Eds., Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, ACM Press, New York, pp. 138–140.
Pearl J (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.
Piwowarski B and Gallinari P (2003) Expected ratio of relevant units: A measure for structured information retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., INitiative for the Evaluation of XML Retrieval (INEX). Proceedings of the Second INEX Workshop, Dagstuhl, France.
Piwowarski B and Lalmas M (2004) Providing consistent and exhaustive relevance assessments for XML retrieval evaluation. In: Proceedings of the Thirteenth Conference on Information and Knowledge Management (CIKM 2004), Washington D.C., USA.
Ribeiro BAN and Muntz R (1996) A belief network model for IR. In: Proceedings of the 19th ACM-SIGIR Conference, pp. 253–260.
Robertson SE (2002) Threshold setting and performance optimization in adaptive filtering. Information Retrieval, 5(2/3):239–256.
Walker S and Robertson SE (1999) Okapi/keenbow at TREC-8. In: Voorhees EM and Harman DK, Eds., NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, USA.
Wilkinson R (1994) Effective retrieval of structured documents. In: Croft WB and van Rijsbergen CJ, Eds., Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, Dublin, Ireland: Springer-Verlag, pp. 311–31.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Piwowarski, B., Gallinari, P. A Bayesian Framework for XML Information Retrieval: Searching and Learning with the INEX Collection. Inf Retrieval 8, 655–681 (2005). https://doi.org/10.1007/s10791-005-0751-6
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-0751-6