Skip to main content

A Machine Learning Model for Information Retrieval with Structured Documents

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2734))

Abstract

Most recent document standards rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex document types. Only a few models have been proposed for handling structured documents, and the design of such systems is still an open problem. We present here a new model for structured document retrieval which allows to compute and to combine the scores of document parts. It is based on bayesian networks and allows for learning the model parameters in the presence of incomplete data. We present an application of this model for ad-hoc retrieval and evaluate its performances on a small structured collection. The model can also be extended to cope with other tasks such as interactive navigation in structured documents or corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ACM SIGIR 2000 Workshop on XML and Information Retrieval. Athens, Greece. July 28, 2000-also published in JASIST, Vol 53, no 6, 2002, special topic issue: XML.

    Google Scholar 

  2. Jamie P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY Retrieval System. In A. Min Tjoa and Isidro Ramos, editors, Database and Expert Systems Applications, Proceedings of the International Conference, pages 78–83, Valencia, Spain, 1992. Springer-Verlag.

    Google Scholar 

  3. Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In 23rd International Conference on Very Large Data Bases, Athens, Greece, 1997.

    Google Scholar 

  4. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood from incomplete data via de EM algorithm. The Journal of Royal Statistical Society, 39:1–37, 1977.

    MATH  MathSciNet  Google Scholar 

  5. Fuhr, N. and Rölleke, T. HySpirit — a Probabilistic Inference Engine for Hypermedia Retrieval in Large Databases. In: Schek, H.-J.; Saltor, F.; Ramos, I.; Alonso, G. (eds.). Proceedings of the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain, pages 24–38. Springer, Berlin, 1998.

    Google Scholar 

  6. Maria Indrawan, Desra Ghazfan, and Bala Srinivasan. Using Bayesian Networks as Retrieval Engines. In ACIS 5th Australasian Conference on Information Systems, pages 259–271, Melbourne, Australia, 1994.

    Google Scholar 

  7. Finn Verner Jensen. An introduction to Bayesian Networks. UCL Press, London, England, 1996.

    Google Scholar 

  8. Daphne Koller and Mehran Sahami. Hierarchically Classifying Documents Using Very Few Words. In ICML-97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 435–443, San Francisco, CA, USA, 1997. Morgan Kaufmann.

    Google Scholar 

  9. Paul Krause. Learning Probabilistic Networks. 1998.

    Google Scholar 

  10. Mounia Lalmas. Dempster-Shafer’s Theory of Evidence Applied to Structured Documents: Modelling Uncertainty. In Proceedings of the 20th Annual International ACM SIGIR, pages 110–118, Philadelphia, PA, USA, July 1997. ACM.

    Google Scholar 

  11. Mounia Lalmas. Uniform representation of content and structure for structured document retrieval. Technical report, Queen Mary & Westfield College, University of London, London, England, 2000.

    Google Scholar 

  12. Mounia Lalmas and Ekaterini Moutogianni. A Dempster-Shafer indexing for the focussed retrieval of a hierarchically structured document space: Implementation and experiments on a web museum collection. In 6th RIAO Conference, Content-Based Multimedia Information Access, Paris, France, April 2000.

    Google Scholar 

  13. Mounia Lalmas, I. Ruthven, and M. Theophylactou. Structured document retrieval using Dempster-Shafer’s Theory of Evidence: Implementation and evaluation. Technical report, University of Glasgow, UK, August 1997.

    Google Scholar 

  14. Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, and Andrew Y. Ng. Improving Text Classification by Shrinkage in a Hierarchy of Classes. In Ivan Brasko and Saso Dzeroski, editors, International Conference on Machine Learning (ICML 98), pages 359–367. Morgan Kaufmann, 1998.

    Google Scholar 

  15. Kevin Patrick Murphy. A Brief Introduction to Graphical Models and Bayesian Networks. web: http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html, October 2000.

    Google Scholar 

  16. Sung Hyon Myaeng, Dong-Hyun Jang, Mun-Seok Kim, and Zong-Cheol Zhoo. A Flexible Model for Retrieval of SGML documents. In W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138–140, Melbourne, Australia, August 1998. ACM Press, New York.

    Chapter  Google Scholar 

  17. Gonzalo Navarro and Ricardo Baeza-Yates. Proximal Nodes: A Model to Query Document Databases by Content and Structure. ACM TOIS, 15(4):401–435, October 1997.

    Article  Google Scholar 

  18. OASIS. Docbook standard. http://www.oasis-open.org/specs/docbook.shtml|, 2 2001.

    Google Scholar 

  19. Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

    Google Scholar 

  20. Berthier A.N. Ribeiro and Richard Muntz. A Belief Network Model for IR. In Proceedings. of the 19th ACM-SIGIR conference, pages 253–260, 1996.

    Google Scholar 

  21. Stephen E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33:294–304, 1977.

    Article  Google Scholar 

  22. Howard R. Turtle and W. Bruce Croft. Evaluation of an Inference Network-Based Retrieval Model. ACM Transactions On Information Systems, 9(3):187–222, 1991.

    Article  Google Scholar 

  23. S. Walker and Stephen E. Robertson. Okapi/Keenbow at TREC-8. In E. M. Voorhees and D.K. Harman, editors, NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, USA, November 1999.

    Google Scholar 

  24. Ross Wilkinson. Effective retrieval of structured documents. In W.B. Croft and C.J. van Rijsbergen, editors, Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, pages 311–317, Dublin, Ireland, July 1994. Springer-Verlag.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Piwowarski, B., Gallinari, P. (2003). A Machine Learning Model for Information Retrieval with Structured Documents. In: Perner, P., Rosenfeld, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2003. Lecture Notes in Computer Science, vol 2734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45065-3_37

Download citation

  • DOI: https://doi.org/10.1007/3-540-45065-3_37

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40504-7

  • Online ISBN: 978-3-540-45065-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics