Skip to main content

Extended VSM for XML Document Classification Using Frequent Subtrees

  • Conference paper
Focused Retrieval and Evaluation (INEX 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6203))

Abstract

Structured link vector model (SLVM) is a representation proposed for modeling XML documents, which was extended from the conventional vector space model (VSM) by incorporating document structures. In this paper, we describe the classification approach for XML documents based on SLVM in the Document Mining Challenge of INEX 2009, where the closed frequent subtrees as structural units are used for content extraction from the XML document and the Chi-square test is used for feature selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berry, M.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, Heidelberg (2003)

    Google Scholar 

  2. Yang, J., Chen, X.: A semi-structured document model for text mining. Journal of Computer Science and Technology 17(5), 603–610 (2002)

    Article  MATH  Google Scholar 

  3. Salton, G., McGill, M.J.: Introduction to Modern information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  4. Yang, J., Zhang, F.: XML Document Classification using Extended VSM. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 234–244. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  5. Yang, J., Cheung, W.K., Chen, X.O.: Learning Element Similarity Matrix for Semi-structured Document Analysis. Knowledge and Information Systems 19, 53–78 (2009)

    Article  Google Scholar 

  6. Vapnic, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Book  Google Scholar 

  7. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  8. Chi, Y., Nijssen, S., Muntz, R.R., Kok, J.N.: Frequent Subtree Mining -An Overview. Fundamenta Informaticae (2005)

    Google Scholar 

  9. Chi, Y., Yang, Y., Xia, Y., Muntz, R.R.: CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 63–73. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 1998 International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  11. Collobert, R., Bengio, S.: SVMTorch: support vector machines for large-scale regression problems. Journal of Machine Learning Research 1, 143–160 (2001)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yang, J., Wang, S. (2010). Extended VSM for XML Document Classification Using Frequent Subtrees. In: Geva, S., Kamps, J., Trotman, A. (eds) Focused Retrieval and Evaluation. INEX 2009. Lecture Notes in Computer Science, vol 6203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14556-8_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14556-8_44

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14555-1

  • Online ISBN: 978-3-642-14556-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics