Extended VSM for XML Document Classification Using Frequent Subtrees

Yang, Jianwu; Wang, Songlin

doi:10.1007/978-3-642-14556-8_44

Jianwu Yang¹⁹ &
Songlin Wang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6203))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

598 Accesses
9 Citations

Abstract

Structured link vector model (SLVM) is a representation proposed for modeling XML documents, which was extended from the conventional vector space model (VSM) by incorporating document structures. In this paper, we describe the classification approach for XML documents based on SLVM in the Document Mining Challenge of INEX 2009, where the closed frequent subtrees as structural units are used for content extraction from the XML document and the Chi-square test is used for feature selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berry, M.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, Heidelberg (2003)
Google Scholar
Yang, J., Chen, X.: A semi-structured document model for text mining. Journal of Computer Science and Technology 17(5), 603–610 (2002)
Article MATH Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Yang, J., Zhang, F.: XML Document Classification using Extended VSM. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 234–244. Springer, Heidelberg (2008)
Chapter Google Scholar
Yang, J., Cheung, W.K., Chen, X.O.: Learning Element Similarity Matrix for Semi-structured Document Analysis. Knowledge and Information Systems 19, 53–78 (2009)
Article Google Scholar
Vapnic, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Chi, Y., Nijssen, S., Muntz, R.R., Kok, J.N.: Frequent Subtree Mining -An Overview. Fundamenta Informaticae (2005)
Google Scholar
Chi, Y., Yang, Y., Xia, Y., Muntz, R.R.: CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 63–73. Springer, Heidelberg (2004)
Chapter Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 1998 International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Collobert, R., Bengio, S.: SVMTorch: support vector machines for large-scale regression problems. Journal of Machine Learning Research 1, 143–160 (2001)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Sci. & Tech., Peking University, Beijing, 100871, China
Jianwu Yang & Songlin Wang

Authors

Jianwu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Songlin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science and Technology, Queensland University of Technology, GPO Box 2434, 4001, Brisbane, Qld, Australia
Shlomo Geva
Archives and Information Studies/Humanities, University of Amsterdam, Turfdraagsterpad 9, 1012 XT, Amsterdam, The Netherlands
Jaap Kamps
Department of Computer Science, University of Otago, P.O. Box 56,, 9054, Dunedin, New Zealand
Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, J., Wang, S. (2010). Extended VSM for XML Document Classification Using Frequent Subtrees. In: Geva, S., Kamps, J., Trotman, A. (eds) Focused Retrieval and Evaluation. INEX 2009. Lecture Notes in Computer Science, vol 6203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14556-8_44

Download citation

DOI: https://doi.org/10.1007/978-3-642-14556-8_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14555-1
Online ISBN: 978-3-642-14556-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics