Skip to main content

A Belief Networks-Based Generative Model for Structured Documents. An Application to the XML Categorization

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2734))

Abstract

We present a generative Bayesian model for the modeling of structured (e.g. XML) documents. This model allows us to simultaneously take into account structure and content information. It is used here for classifying XML documents. We adopt a machine learning approach and the model parameters are learned from a labeled training set of representative documents. We discuss the role of structural information for classification and describe experiments on a small collection of class labeled structured documents. We also present preliminary results showing how this model could classify documents with DTDs not represented in the training set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Jamie P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY Retrieval System. In A. Min Tjoa and Isidro Ramos, editors, Database and Expert Systems Applications, Proceedings of the International Conference, pages 78–83, Valencia, Spain, 1992. Springer-Verlag.

    Google Scholar 

  2. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of AAAI-98, 15th Conference of the American Association for Artificial Intelligence, pages 509–516, Madison, US, 1998. AAAI Press, Menlo Park, US. An extended version appears as [3].

    Google Scholar 

  3. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1/2):69–113, 2000.

    Article  MATH  Google Scholar 

  4. Ludovic Denoyer, Hugo Zaragoza, and Patrick Gallinari. HMM-based passage models for document classification and ranking. In Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research, pages 126–135, Darmstadt, DE, 2001.

    Google Scholar 

  5. M. Dilegenti, M. Gori, M. Maggini, and F. Scarselli. Classification of html documents by hidden tree-markov models. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), pages 849–853, Seatle, 2001. WA (USA).

    Google Scholar 

  6. Susan T. Dumais and Hao Chen. Hierarchical classification of Web content. In Nicholas J. Belkin, Peter Ingwersen, and Mun-Kew Leong, editors, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263, Athens, GR, 2000. ACM Press, New York, US.

    Chapter  Google Scholar 

  7. Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hidden markov model: Analysis and applications. Machine Learning, 32(1):41–62, 1998.

    Article  MATH  Google Scholar 

  8. Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Published in the “Lecture Notes in Computer Science” series, number 1398.

    Chapter  Google Scholar 

  9. Jin H. Kim and Judea Pearl. A Computational Model for Causal and Diagnostic Reasoning in Inference Systems. In Alan Bundy, editor, Proceedings of the 8th International Joint Conference on Artificial Intelligence, Karlsruhe, Germany, August 1983. William Kaufmann.

    Google Scholar 

  10. David D. Lewis. Reuters-21578 text categorization test collection. AT&T Labs-Research, September 1997.

    Google Scholar 

  11. David D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4–15, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Published in the “Lecture Notes in Computer Science” series, number 1398.

    Chapter  Google Scholar 

  12. Cline M. Utilizing HTML structure and linked pages to improve learning for text categorization. In Undergraduate Honors Thesis, Department of Computer Science, University of Texas.

    Google Scholar 

  13. K. Murphy and M. Paskin. Linear time inference in hierarchical hmms, 2001.

    Google Scholar 

  14. Sung Hyon Myaeng, Dong-Hyun Jang, Mun-Seok Kim, and Zong-Cheol Zhoo. A Flexible Model for Retrieval of SGML documents. In W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138–140, Melbourne, Australia, August 1998. ACM Press, New York.

    Chapter  Google Scholar 

  15. Benjamin Piwowarky and Patrick Gallinari. A Bayesian Network Model for Page Retrieval in a Hierarchically Structured Collection. In XML Workshop of the 25th ACM SIGIR Conference, Tampere, Finland, 2002.

    Google Scholar 

  16. B. Piwowarski, L. Denoyer, and P. Gallinari. Un modele pour la recherche d’informations sur les documents structures. In Proceedings of the 6emes journees Internationales d’Analyse Statistique des Donnees Textuelles (JADT2002).

    Google Scholar 

  17. CH. Y. Quek. Classification of world wide web documents, 1997.

    Google Scholar 

  18. Reuters. The reuters corpus volume 1 english language 1996-08-20 to 1997-08-19.

    Google Scholar 

  19. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.

    Article  Google Scholar 

  20. Trec. Text REtrieval Conference (trec 2001), National Institute of Standards and Technology (NIST).

    Google Scholar 

  21. Yiming Yang, Seán Slattery, and Rayid Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3):219–241, 2002. Special Issue on Automated Text Categorization.

    Article  Google Scholar 

  22. Jeonghee Yi and Neel Sundaresan. A classifier for semi-structured documents. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 340–344. ACM Press, 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Denoyer, L., Gallinari, P. (2003). A Belief Networks-Based Generative Model for Structured Documents. An Application to the XML Categorization. In: Perner, P., Rosenfeld, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2003. Lecture Notes in Computer Science, vol 2734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45065-3_29

Download citation

  • DOI: https://doi.org/10.1007/3-540-45065-3_29

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40504-7

  • Online ISBN: 978-3-540-45065-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics