Abstract
Supervised classification aims to learn a model (or a classifier) from a collection of XML documents individually marked with one of a predefined set of class labels. The learnt classifier isolates each class by the content and structural regularities observed within the respective labeled XML documents and, thus, allows to predict the unknown class of unlabeled XML documents by looking at their content and structural features. The classification of unlabeled XML documents into the predefined classes is a valuable support for more effective and efficient XML search, retrieval and filtering.
We discuss an approach for learning intelligible XML classifiers. XML documents are represented as transactions in a space of boolean features, that are informative of their content and structure. Learning algorithms induce compact associative classifiers with outperforming effectiveness from the transactional XML representation. A preprocessing step contributes to the scalability of the approach with the size of XML corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of Int. Conf. on Very Large Data Bases, pp. 487–499 (1994)
Arunasalam, B., Chawla, S.: CCCS: A Top-Down Association Classifier for Imbalanced Class Distribution. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 517–522 (2006)
Baker, L., McCallum, A.: Distributional Clustering of Words for Text Classification. In: Proc. of ACM Int. Conf. on Research and Development in Information Retrieval, pp. 96–103 (1998)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999)
Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Coenen, F.: LUCS KDD implementations of CBA and CMAR. Dpt. of Computer Science, University of Liverpool, http://www.csc.liv.ac.uk/frans/KDD/Software/
de Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Romero, A.E.: Probabilistic Methods for Structured Document Classification at INEX’07. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 195–206. Springer, Heidelberg (2008)
Denoyer, L., Gallinari, P.: Report on the XML Mining Track at Inex 2005 and Inex 2006. ACM SIGIR Forum 41(1), 79–90 (2007)
Denoyer, L., Gallinari, P.: Report on the XML Mining Track at Inex 2007. ACM SIGIR Forum 42(1), 22–28 (2008)
Garboni, C., Masseglia, F., Trousse, B.: Sequential Pattern Mining for Structure-Based XML Document Classification. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 458–468. Springer, Heidelberg (2006)
De Knijf, J.: FAT-CAT: Frequent Attributes Tree Based Classification. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 485–496. Springer, Heidelberg (2007)
Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: Proc. of Conf. on Knowledge Discovery and Data Mining, pp. 80–86 (1998)
Liu, B., Ma, Y., Wong, C.K.: Improving an Association Rule Based Classifier. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 504–509. Springer, Heidelberg (2000)
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Murugeshan, M., Lakshmi, K., Mukherjee, S.: A Categorization Approach for Wikipedia Collection based on Negative Category Information and Initial Descriptions. In: Proc. of the Initiative for the Evaluation of XML Retrieval (INEX 2007), pp. 212–214 (2007)
Ning, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley (2006)
Theobald, M., Schenkel, R., Weikum, G.: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In: Proc. of WebDB Workshop, pp. 1–6 (2003)
Yin, X., Han, J.: CPAR: Classification based on Predictive Association Rules. In: Proc. of SIAM Int. Conf. on Data Mining, pp. 331–335 (2003)
Xing, G., Guo, J., Xia, Z.: Classifying XML Documents Based on Structure/Content Similarity. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 444–457. Springer, Heidelberg (2007)
Yang, J., Zhang, F.: XML Document Classification Using Extended VSM. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 234–244. Springer, Heidelberg (2008)
Yong, S.L., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Gori, M.: Document Mining Using Graph Neural Network. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 458–472. Springer, Heidelberg (2007)
Zaki, M., Aggarwal, C.: XRules: An Effective Algorithm for Structural Classification of XML Data. Machine Learning 62(1-2), 137–170 (2006)
Bratko, A., Filipic, B.: Exploiting Structural Information for Semi-structured Document Categorization. Information Processing and Management 42(3), 679–694 (2006)
Yang, J., Wang, S.: Extended VSM for XML Document Classification Using Frequent Subtrees. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 441–448. Springer, Heidelberg (2010)
Yang, J., Chen, X.: A Semi-structured Document Model for Text Mining. Journal of Computer Science and Technology 17(5), 603–610 (2002)
Yi, J., Sundaresan, N.: A Classifier for Semi-Structured Documents. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 340–344 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Costa, G., Ortale, R., Ritacco, E. (2013). Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2011. Communications in Computer and Information Science, vol 348. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37186-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-37186-8_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37185-1
Online ISBN: 978-3-642-37186-8
eBook Packages: Computer ScienceComputer Science (R0)