Abstract
We present a generative Bayesian model for the modeling of structured (e.g. XML) documents. This model allows us to simultaneously take into account structure and content information. It is used here for classifying XML documents. We adopt a machine learning approach and the model parameters are learned from a labeled training set of representative documents. We discuss the role of structural information for classification and describe experiments on a small collection of class labeled structured documents. We also present preliminary results showing how this model could classify documents with DTDs not represented in the training set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Jamie P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY Retrieval System. In A. Min Tjoa and Isidro Ramos, editors, Database and Expert Systems Applications, Proceedings of the International Conference, pages 78–83, Valencia, Spain, 1992. Springer-Verlag.
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of AAAI-98, 15th Conference of the American Association for Artificial Intelligence, pages 509–516, Madison, US, 1998. AAAI Press, Menlo Park, US. An extended version appears as [3].
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. McCallum, Tom M. Mitchell, Kamal Nigam, and Seán Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1/2):69–113, 2000.
Ludovic Denoyer, Hugo Zaragoza, and Patrick Gallinari. HMM-based passage models for document classification and ranking. In Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research, pages 126–135, Darmstadt, DE, 2001.
M. Dilegenti, M. Gori, M. Maggini, and F. Scarselli. Classification of html documents by hidden tree-markov models. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), pages 849–853, Seatle, 2001. WA (USA).
Susan T. Dumais and Hao Chen. Hierarchical classification of Web content. In Nicholas J. Belkin, Peter Ingwersen, and Mun-Kew Leong, editors, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263, Athens, GR, 2000. ACM Press, New York, US.
Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hidden markov model: Analysis and applications. Machine Learning, 32(1):41–62, 1998.
Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Published in the “Lecture Notes in Computer Science” series, number 1398.
Jin H. Kim and Judea Pearl. A Computational Model for Causal and Diagnostic Reasoning in Inference Systems. In Alan Bundy, editor, Proceedings of the 8th International Joint Conference on Artificial Intelligence, Karlsruhe, Germany, August 1983. William Kaufmann.
David D. Lewis. Reuters-21578 text categorization test collection. AT&T Labs-Research, September 1997.
David D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4–15, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Published in the “Lecture Notes in Computer Science” series, number 1398.
Cline M. Utilizing HTML structure and linked pages to improve learning for text categorization. In Undergraduate Honors Thesis, Department of Computer Science, University of Texas.
K. Murphy and M. Paskin. Linear time inference in hierarchical hmms, 2001.
Sung Hyon Myaeng, Dong-Hyun Jang, Mun-Seok Kim, and Zong-Cheol Zhoo. A Flexible Model for Retrieval of SGML documents. In W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138–140, Melbourne, Australia, August 1998. ACM Press, New York.
Benjamin Piwowarky and Patrick Gallinari. A Bayesian Network Model for Page Retrieval in a Hierarchically Structured Collection. In XML Workshop of the 25th ACM SIGIR Conference, Tampere, Finland, 2002.
B. Piwowarski, L. Denoyer, and P. Gallinari. Un modele pour la recherche d’informations sur les documents structures. In Proceedings of the 6emes journees Internationales d’Analyse Statistique des Donnees Textuelles (JADT2002).
CH. Y. Quek. Classification of world wide web documents, 1997.
Reuters. The reuters corpus volume 1 english language 1996-08-20 to 1997-08-19.
Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.
Trec. Text REtrieval Conference (trec 2001), National Institute of Standards and Technology (NIST).
Yiming Yang, Seán Slattery, and Rayid Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3):219–241, 2002. Special Issue on Automated Text Categorization.
Jeonghee Yi and Neel Sundaresan. A classifier for semi-structured documents. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 340–344. ACM Press, 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Denoyer, L., Gallinari, P. (2003). A Belief Networks-Based Generative Model for Structured Documents. An Application to the XML Categorization. In: Perner, P., Rosenfeld, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2003. Lecture Notes in Computer Science, vol 2734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45065-3_29
Download citation
DOI: https://doi.org/10.1007/3-540-45065-3_29
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40504-7
Online ISBN: 978-3-540-45065-8
eBook Packages: Springer Book Archive