Abstract
Most recent document standards rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex document types. Only a few models have been proposed for handling structured documents, and the design of such systems is still an open problem. We present here a new model for structured document retrieval which allows to compute and to combine the scores of document parts. It is based on bayesian networks and allows for learning the model parameters in the presence of incomplete data. We present an application of this model for ad-hoc retrieval and evaluate its performances on a small structured collection. The model can also be extended to cope with other tasks such as interactive navigation in structured documents or corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
ACM SIGIR 2000 Workshop on XML and Information Retrieval. Athens, Greece. July 28, 2000-also published in JASIST, Vol 53, no 6, 2002, special topic issue: XML.
Jamie P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY Retrieval System. In A. Min Tjoa and Isidro Ramos, editors, Database and Expert Systems Applications, Proceedings of the International Conference, pages 78–83, Valencia, Spain, 1992. Springer-Verlag.
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In 23rd International Conference on Very Large Data Bases, Athens, Greece, 1997.
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood from incomplete data via de EM algorithm. The Journal of Royal Statistical Society, 39:1–37, 1977.
Fuhr, N. and Rölleke, T. HySpirit — a Probabilistic Inference Engine for Hypermedia Retrieval in Large Databases. In: Schek, H.-J.; Saltor, F.; Ramos, I.; Alonso, G. (eds.). Proceedings of the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain, pages 24–38. Springer, Berlin, 1998.
Maria Indrawan, Desra Ghazfan, and Bala Srinivasan. Using Bayesian Networks as Retrieval Engines. In ACIS 5th Australasian Conference on Information Systems, pages 259–271, Melbourne, Australia, 1994.
Finn Verner Jensen. An introduction to Bayesian Networks. UCL Press, London, England, 1996.
Daphne Koller and Mehran Sahami. Hierarchically Classifying Documents Using Very Few Words. In ICML-97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 435–443, San Francisco, CA, USA, 1997. Morgan Kaufmann.
Paul Krause. Learning Probabilistic Networks. 1998.
Mounia Lalmas. Dempster-Shafer’s Theory of Evidence Applied to Structured Documents: Modelling Uncertainty. In Proceedings of the 20th Annual International ACM SIGIR, pages 110–118, Philadelphia, PA, USA, July 1997. ACM.
Mounia Lalmas. Uniform representation of content and structure for structured document retrieval. Technical report, Queen Mary & Westfield College, University of London, London, England, 2000.
Mounia Lalmas and Ekaterini Moutogianni. A Dempster-Shafer indexing for the focussed retrieval of a hierarchically structured document space: Implementation and experiments on a web museum collection. In 6th RIAO Conference, Content-Based Multimedia Information Access, Paris, France, April 2000.
Mounia Lalmas, I. Ruthven, and M. Theophylactou. Structured document retrieval using Dempster-Shafer’s Theory of Evidence: Implementation and evaluation. Technical report, University of Glasgow, UK, August 1997.
Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, and Andrew Y. Ng. Improving Text Classification by Shrinkage in a Hierarchy of Classes. In Ivan Brasko and Saso Dzeroski, editors, International Conference on Machine Learning (ICML 98), pages 359–367. Morgan Kaufmann, 1998.
Kevin Patrick Murphy. A Brief Introduction to Graphical Models and Bayesian Networks. web: http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html, October 2000.
Sung Hyon Myaeng, Dong-Hyun Jang, Mun-Seok Kim, and Zong-Cheol Zhoo. A Flexible Model for Retrieval of SGML documents. In W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138–140, Melbourne, Australia, August 1998. ACM Press, New York.
Gonzalo Navarro and Ricardo Baeza-Yates. Proximal Nodes: A Model to Query Document Databases by Content and Structure. ACM TOIS, 15(4):401–435, October 1997.
OASIS. Docbook standard. http://www.oasis-open.org/specs/docbook.shtml|, 2 2001.
Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
Berthier A.N. Ribeiro and Richard Muntz. A Belief Network Model for IR. In Proceedings. of the 19th ACM-SIGIR conference, pages 253–260, 1996.
Stephen E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33:294–304, 1977.
Howard R. Turtle and W. Bruce Croft. Evaluation of an Inference Network-Based Retrieval Model. ACM Transactions On Information Systems, 9(3):187–222, 1991.
S. Walker and Stephen E. Robertson. Okapi/Keenbow at TREC-8. In E. M. Voorhees and D.K. Harman, editors, NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, USA, November 1999.
Ross Wilkinson. Effective retrieval of structured documents. In W.B. Croft and C.J. van Rijsbergen, editors, Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, pages 311–317, Dublin, Ireland, July 1994. Springer-Verlag.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Piwowarski, B., Gallinari, P. (2003). A Machine Learning Model for Information Retrieval with Structured Documents. In: Perner, P., Rosenfeld, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2003. Lecture Notes in Computer Science, vol 2734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45065-3_37
Download citation
DOI: https://doi.org/10.1007/3-540-45065-3_37
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40504-7
Online ISBN: 978-3-540-45065-8
eBook Packages: Springer Book Archive