Sequential Pattern Mining for Structure-Based XML Document Classification

Garboni, Calin; Masseglia, Florent; Trousse, Brigitte

doi:10.1007/978-3-540-34963-1_35

Calin Garboni^20,21,
Florent Masseglia²¹ &
Brigitte Trousse²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3977))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

394 Accesses
7 Citations

Abstract

This article presents an original supervised classification technique for XML documents which is based on structure only. Each XML document is viewed as an ordered labeled tree, represented by his tags only. Our method has three steps. After a cleaning step, we characterize each predefined cluster in terms of frequent structural subsequences. Then we classify the XML documents based on the mined patterns of each cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Structural XML Classification in Concept Drifting Data Streams

Article 01 July 2015

Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Article 04 August 2017

References

Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In: Proceedings of the 1993 ACM SIGMOD Conference, Washington DC, USA, pp. 207–216 (May 1993)
Google Scholar
Theodore Dalamagas, T., Cheng, K., Winkel, K., Sellis, T.: Clustering xml documents using structural summarie. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)
Chapter Google Scholar
De Francesca, F., Gordano, G., Ortale, R., Tagarelli, A.: Distance-based clustering of xml documents. In: ECML/PKDD 2003 workshop proceedings, pp. 75–78 (September 2003)
Google Scholar
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents, pp. 165–176 (2000)
Google Scholar
Laur, P.A., Masseglia, F., Poncelet, P.: Schema mining: Finding structural regularity among semi structured data. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 498–503. Springer, Heidelberg (2000)
Chapter Google Scholar
Lian, W., Cheung, D.W.-L., Mamoulis, N., Yiu, S.-M.: An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16(1) (January 2004)
Google Scholar
Miyahara, T., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H.: Discovery of frequent tree structured patterns in semistructured web documents. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS, vol. 2035, pp. 47–52. Springer, Heidelberg (2001)
Chapter Google Scholar
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data, pp. 295–306 (1998)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the Fifth International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA (June 2002)
Google Scholar
Srikant, R., Agrawal, R.: Mining Sequential Patterns: Generalizations and Performance Improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)
Google Scholar
Termier, A., Rousset, M.-C., Se’bag, M.: Treefinder: a first step towards xml data mining. In: International Conference on Data Mining (ICDM 2002), Maebashi City, Japan (2002)
Google Scholar
Wang, K., Liu, H.: Discovering structural association of semistructured data. Knowledge and Data Engineering 12(2), 353–371 (2000)
Article Google Scholar
Zaki, M.: Efficiently mining frequent trees in a forest. In: KDD 2002 (July 2002)
Google Scholar

Download references

Author information

Authors and Affiliations

West University of Timisoara, Romania
Calin Garboni
INRIA Sophia Antipolis, AxIS Research Team 2004, route des Lucioles, BP93, F-06902 Cedex, Sophia Antiplis, France
Calin Garboni, Florent Masseglia & Brigitte Trousse

Authors

Calin Garboni
View author publications
You can also search for this author in PubMed Google Scholar
Florent Masseglia
View author publications
You can also search for this author in PubMed Google Scholar
Brigitte Trousse
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Duisburg-Essen, Duisburg, Germany
Norbert Fuhr
Queen Mary, University of London, London, UK
Mounia Lalmas
University Duisburg-Essen, Germany
Saadia Malik
Microsoft Research Cambridge, United Kingdom
Gabriella Kazai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garboni, C., Masseglia, F., Trousse, B. (2006). Sequential Pattern Mining for Structure-Based XML Document Classification. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds) Advances in XML Information Retrieval and Evaluation. INEX 2005. Lecture Notes in Computer Science, vol 3977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-34963-1_35

Download citation

DOI: https://doi.org/10.1007/978-3-540-34963-1_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34962-4
Online ISBN: 978-3-540-34963-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics