A Belief Networks-Based Generative Model for Structured Documents. An Application to the XML Categorization

Denoyer, Ludovic; Gallinari, Patrick

doi:10.1007/3-540-45065-3_29

Ludovic Denoyer⁵ &
Patrick Gallinari⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2734))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

1398 Accesses
5 Citations

Abstract

We present a generative Bayesian model for the modeling of structured (e.g. XML) documents. This model allows us to simultaneously take into account structure and content information. It is used here for classifying XML documents. We adopt a machine learning approach and the model parameters are learned from a labeled training set of representative documents. We discuss the role of structural information for classification and describe experiments on a small collection of class labeled structured documents. We also present preliminary results showing how this model could classify documents with DTDs not represented in the training set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jamie P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY Retrieval System. In A. Min Tjoa and Isidro Ramos, editors, Database and Expert Systems Applications, Proceedings of the International Conference, pages 78–83, Valencia, Spain, 1992. Springer-Verlag.
Google Scholar
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of AAAI-98, 15th Conference of the American Association for Artificial Intelligence, pages 509–516, Madison, US, 1998. AAAI Press, Menlo Park, US. An extended version appears as [3].
Google Scholar
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1/2):69–113, 2000.
Article MATH Google Scholar
Ludovic Denoyer, Hugo Zaragoza, and Patrick Gallinari. HMM-based passage models for document classification and ranking. In Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research, pages 126–135, Darmstadt, DE, 2001.
Google Scholar
M. Dilegenti, M. Gori, M. Maggini, and F. Scarselli. Classification of html documents by hidden tree-markov models. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), pages 849–853, Seatle, 2001. WA (USA).
Google Scholar
Susan T. Dumais and Hao Chen. Hierarchical classification of Web content. In Nicholas J. Belkin, Peter Ingwersen, and Mun-Kew Leong, editors, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263, Athens, GR, 2000. ACM Press, New York, US.
Chapter Google Scholar
Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hidden markov model: Analysis and applications. Machine Learning, 32(1):41–62, 1998.
Article MATH Google Scholar
Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Published in the “Lecture Notes in Computer Science” series, number 1398.
Chapter Google Scholar
Jin H. Kim and Judea Pearl. A Computational Model for Causal and Diagnostic Reasoning in Inference Systems. In Alan Bundy, editor, Proceedings of the 8th International Joint Conference on Artificial Intelligence, Karlsruhe, Germany, August 1983. William Kaufmann.
Google Scholar
David D. Lewis. Reuters-21578 text categorization test collection. AT&T Labs-Research, September 1997.
Google Scholar
David D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4–15, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Published in the “Lecture Notes in Computer Science” series, number 1398.
Chapter Google Scholar
Cline M. Utilizing HTML structure and linked pages to improve learning for text categorization. In Undergraduate Honors Thesis, Department of Computer Science, University of Texas.
Google Scholar
K. Murphy and M. Paskin. Linear time inference in hierarchical hmms, 2001.
Google Scholar
Sung Hyon Myaeng, Dong-Hyun Jang, Mun-Seok Kim, and Zong-Cheol Zhoo. A Flexible Model for Retrieval of SGML documents. In W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138–140, Melbourne, Australia, August 1998. ACM Press, New York.
Chapter Google Scholar
Benjamin Piwowarky and Patrick Gallinari. A Bayesian Network Model for Page Retrieval in a Hierarchically Structured Collection. In XML Workshop of the 25th ACM SIGIR Conference, Tampere, Finland, 2002.
Google Scholar
B. Piwowarski, L. Denoyer, and P. Gallinari. Un modele pour la recherche d’informations sur les documents structures. In Proceedings of the 6emes journees Internationales d’Analyse Statistique des Donnees Textuelles (JADT2002).
Google Scholar
CH. Y. Quek. Classification of world wide web documents, 1997.
Google Scholar
Reuters. The reuters corpus volume 1 english language 1996-08-20 to 1997-08-19.
Google Scholar
Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.
Article Google Scholar
Trec. Text REtrieval Conference (trec 2001), National Institute of Standards and Technology (NIST).
Google Scholar
Yiming Yang, Seán Slattery, and Rayid Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3):219–241, 2002. Special Issue on Automated Text Categorization.
Article Google Scholar
Jeonghee Yi and Neel Sundaresan. A classifier for semi-structured documents. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 340–344. ACM Press, 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire d’Informatique de Paris VI, LIP6, France
Ludovic Denoyer & Patrick Gallinari

Authors

Ludovic Denoyer
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Gallinari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, Arndtstr. 4, 04275, Leipzig, Germany
Petra Perner
Center for Automation Research, University of Maryland, College Park, Maryland, 20742-3275, USA
Azriel Rosenfeld

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Denoyer, L., Gallinari, P. (2003). A Belief Networks-Based Generative Model for Structured Documents. An Application to the XML Categorization. In: Perner, P., Rosenfeld, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2003. Lecture Notes in Computer Science, vol 2734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45065-3_29

Download citation

DOI: https://doi.org/10.1007/3-540-45065-3_29
Published: 24 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40504-7
Online ISBN: 978-3-540-45065-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics