Skip to main content

Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content

  • Conference paper
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2011)

Abstract

Supervised classification aims to learn a model (or a classifier) from a collection of XML documents individually marked with one of a predefined set of class labels. The learnt classifier isolates each class by the content and structural regularities observed within the respective labeled XML documents and, thus, allows to predict the unknown class of unlabeled XML documents by looking at their content and structural features. The classification of unlabeled XML documents into the predefined classes is a valuable support for more effective and efficient XML search, retrieval and filtering.

We discuss an approach for learning intelligible XML classifiers. XML documents are represented as transactions in a space of boolean features, that are informative of their content and structure. Learning algorithms induce compact associative classifiers with outperforming effectiveness from the transactional XML representation. A preprocessing step contributes to the scalability of the approach with the size of XML corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of Int. Conf. on Very Large Data Bases, pp. 487–499 (1994)

    Google Scholar 

  2. Arunasalam, B., Chawla, S.: CCCS: A Top-Down Association Classifier for Imbalanced Class Distribution. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 517–522 (2006)

    Google Scholar 

  3. Baker, L., McCallum, A.: Distributional Clustering of Words for Text Classification. In: Proc. of ACM Int. Conf. on Research and Development in Information Retrieval, pp. 96–103 (1998)

    Google Scholar 

  4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999)

    Google Scholar 

  5. Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)

    Article  Google Scholar 

  6. Coenen, F.: LUCS KDD implementations of CBA and CMAR. Dpt. of Computer Science, University of Liverpool, http://www.csc.liv.ac.uk/frans/KDD/Software/

  7. de Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Romero, A.E.: Probabilistic Methods for Structured Document Classification at INEX’07. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 195–206. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  8. Denoyer, L., Gallinari, P.: Report on the XML Mining Track at Inex 2005 and Inex 2006. ACM SIGIR Forum 41(1), 79–90 (2007)

    Article  Google Scholar 

  9. Denoyer, L., Gallinari, P.: Report on the XML Mining Track at Inex 2007. ACM SIGIR Forum 42(1), 22–28 (2008)

    Article  Google Scholar 

  10. Garboni, C., Masseglia, F., Trousse, B.: Sequential Pattern Mining for Structure-Based XML Document Classification. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 458–468. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  11. De Knijf, J.: FAT-CAT: Frequent Attributes Tree Based Classification. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 485–496. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: Proc. of Conf. on Knowledge Discovery and Data Mining, pp. 80–86 (1998)

    Google Scholar 

  13. Liu, B., Ma, Y., Wong, C.K.: Improving an Association Rule Based Classifier. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 504–509. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  14. Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)

    Google Scholar 

  15. Murugeshan, M., Lakshmi, K., Mukherjee, S.: A Categorization Approach for Wikipedia Collection based on Negative Category Information and Initial Descriptions. In: Proc. of the Initiative for the Evaluation of XML Retrieval (INEX 2007), pp. 212–214 (2007)

    Google Scholar 

  16. Ning, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley (2006)

    Google Scholar 

  17. Theobald, M., Schenkel, R., Weikum, G.: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In: Proc. of WebDB Workshop, pp. 1–6 (2003)

    Google Scholar 

  18. Yin, X., Han, J.: CPAR: Classification based on Predictive Association Rules. In: Proc. of SIAM Int. Conf. on Data Mining, pp. 331–335 (2003)

    Google Scholar 

  19. Xing, G., Guo, J., Xia, Z.: Classifying XML Documents Based on Structure/Content Similarity. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 444–457. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  20. Yang, J., Zhang, F.: XML Document Classification Using Extended VSM. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 234–244. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  21. Yong, S.L., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Gori, M.: Document Mining Using Graph Neural Network. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 458–472. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  22. Zaki, M., Aggarwal, C.: XRules: An Effective Algorithm for Structural Classification of XML Data. Machine Learning 62(1-2), 137–170 (2006)

    Article  Google Scholar 

  23. Bratko, A., Filipic, B.: Exploiting Structural Information for Semi-structured Document Categorization. Information Processing and Management 42(3), 679–694 (2006)

    Article  Google Scholar 

  24. Yang, J., Wang, S.: Extended VSM for XML Document Classification Using Frequent Subtrees. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 441–448. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  25. Yang, J., Chen, X.: A Semi-structured Document Model for Text Mining. Journal of Computer Science and Technology 17(5), 603–610 (2002)

    Article  MATH  Google Scholar 

  26. Yi, J., Sundaresan, N.: A Classifier for Semi-Structured Documents. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 340–344 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Costa, G., Ortale, R., Ritacco, E. (2013). Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2011. Communications in Computer and Information Science, vol 348. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37186-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37186-8_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37185-1

  • Online ISBN: 978-3-642-37186-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics