Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content

Costa, Gianni; Ortale, Riccardo; Ritacco, Ettore

doi:10.1007/978-3-642-37186-8_10

Gianni Costa⁵,
Riccardo Ortale⁵ &
Ettore Ritacco⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 348))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

1149 Accesses
1 Citations

Abstract

Supervised classification aims to learn a model (or a classifier) from a collection of XML documents individually marked with one of a predefined set of class labels. The learnt classifier isolates each class by the content and structural regularities observed within the respective labeled XML documents and, thus, allows to predict the unknown class of unlabeled XML documents by looking at their content and structural features. The classification of unlabeled XML documents into the predefined classes is a valuable support for more effective and efficient XML search, retrieval and filtering.

We discuss an approach for learning intelligible XML classifiers. XML documents are represented as transactions in a space of boolean features, that are informative of their content and structure. Learning algorithms induce compact associative classifiers with outperforming effectiveness from the transactional XML representation. A preprocessing step contributes to the scalability of the approach with the size of XML corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of Int. Conf. on Very Large Data Bases, pp. 487–499 (1994)
Google Scholar
Arunasalam, B., Chawla, S.: CCCS: A Top-Down Association Classifier for Imbalanced Class Distribution. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 517–522 (2006)
Google Scholar
Baker, L., McCallum, A.: Distributional Clustering of Words for Text Classification. In: Proc. of ACM Int. Conf. on Research and Development in Information Retrieval, pp. 96–103 (1998)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999)
Google Scholar
Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Article Google Scholar
Coenen, F.: LUCS KDD implementations of CBA and CMAR. Dpt. of Computer Science, University of Liverpool, http://www.csc.liv.ac.uk/frans/KDD/Software/
de Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Romero, A.E.: Probabilistic Methods for Structured Document Classification at INEX’07. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 195–206. Springer, Heidelberg (2008)
Chapter Google Scholar
Denoyer, L., Gallinari, P.: Report on the XML Mining Track at Inex 2005 and Inex 2006. ACM SIGIR Forum 41(1), 79–90 (2007)
Article Google Scholar
Denoyer, L., Gallinari, P.: Report on the XML Mining Track at Inex 2007. ACM SIGIR Forum 42(1), 22–28 (2008)
Article Google Scholar
Garboni, C., Masseglia, F., Trousse, B.: Sequential Pattern Mining for Structure-Based XML Document Classification. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 458–468. Springer, Heidelberg (2006)
Chapter Google Scholar
De Knijf, J.: FAT-CAT: Frequent Attributes Tree Based Classification. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 485–496. Springer, Heidelberg (2007)
Chapter Google Scholar
Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: Proc. of Conf. on Knowledge Discovery and Data Mining, pp. 80–86 (1998)
Google Scholar
Liu, B., Ma, Y., Wong, C.K.: Improving an Association Rule Based Classifier. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 504–509. Springer, Heidelberg (2000)
Chapter Google Scholar
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Murugeshan, M., Lakshmi, K., Mukherjee, S.: A Categorization Approach for Wikipedia Collection based on Negative Category Information and Initial Descriptions. In: Proc. of the Initiative for the Evaluation of XML Retrieval (INEX 2007), pp. 212–214 (2007)
Google Scholar
Ning, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley (2006)
Google Scholar
Theobald, M., Schenkel, R., Weikum, G.: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In: Proc. of WebDB Workshop, pp. 1–6 (2003)
Google Scholar
Yin, X., Han, J.: CPAR: Classification based on Predictive Association Rules. In: Proc. of SIAM Int. Conf. on Data Mining, pp. 331–335 (2003)
Google Scholar
Xing, G., Guo, J., Xia, Z.: Classifying XML Documents Based on Structure/Content Similarity. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 444–457. Springer, Heidelberg (2007)
Chapter Google Scholar
Yang, J., Zhang, F.: XML Document Classification Using Extended VSM. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 234–244. Springer, Heidelberg (2008)
Chapter Google Scholar
Yong, S.L., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Gori, M.: Document Mining Using Graph Neural Network. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 458–472. Springer, Heidelberg (2007)
Chapter Google Scholar
Zaki, M., Aggarwal, C.: XRules: An Effective Algorithm for Structural Classification of XML Data. Machine Learning 62(1-2), 137–170 (2006)
Article Google Scholar
Bratko, A., Filipic, B.: Exploiting Structural Information for Semi-structured Document Categorization. Information Processing and Management 42(3), 679–694 (2006)
Article Google Scholar
Yang, J., Wang, S.: Extended VSM for XML Document Classification Using Frequent Subtrees. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 441–448. Springer, Heidelberg (2010)
Chapter Google Scholar
Yang, J., Chen, X.: A Semi-structured Document Model for Text Mining. Journal of Computer Science and Technology 17(5), 603–610 (2002)
Article MATH Google Scholar
Yi, J., Sundaresan, N.: A Classifier for Semi-Structured Documents. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 340–344 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

ICAR-CNR, I87036, Rende, CS, Italy
Gianni Costa, Riccardo Ortale & Ettore Ritacco

Authors

Gianni Costa
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Ortale
View author publications
You can also search for this author in PubMed Google Scholar
Ettore Ritacco
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IST - Technical University of Lisbon, Av.Rovisco Pais, 1, 1049-001, Lisbon, Portugal
Ana Fred
Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
Jan L. G. Dietz
Informatics Research Centre, Henley Business School, University of Reading, RG6 6UD, Reading, UK
Kecheng Liu
INSTICC and IPS, Estefanilha, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Costa, G., Ortale, R., Ritacco, E. (2013). Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2011. Communications in Computer and Information Science, vol 348. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37186-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-37186-8_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37185-1
Online ISBN: 978-3-642-37186-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics