Skip to main content

Sequential Pattern Mining for Structure-Based XML Document Classification

  • Conference paper
Advances in XML Information Retrieval and Evaluation (INEX 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3977))

Abstract

This article presents an original supervised classification technique for XML documents which is based on structure only. Each XML document is viewed as an ordered labeled tree, represented by his tags only. Our method has three steps. After a cleaning step, we characterize each predefined cluster in terms of frequent structural subsequences. Then we classify the XML documents based on the mined patterns of each cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In: Proceedings of the 1993 ACM SIGMOD Conference, Washington DC, USA, pp. 207–216 (May 1993)

    Google Scholar 

  2. Theodore Dalamagas, T., Cheng, K., Winkel, K., Sellis, T.: Clustering xml documents using structural summarie. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  3. De Francesca, F., Gordano, G., Ortale, R., Tagarelli, A.: Distance-based clustering of xml documents. In: ECML/PKDD 2003 workshop proceedings, pp. 75–78 (September 2003)

    Google Scholar 

  4. Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents, pp. 165–176 (2000)

    Google Scholar 

  5. Laur, P.A., Masseglia, F., Poncelet, P.: Schema mining: Finding structural regularity among semi structured data. In: Zighed, D.A., Komorowski, J., Å»ytkow, J.M. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 498–503. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  6. Lian, W., Cheung, D.W.-L., Mamoulis, N., Yiu, S.-M.: An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16(1) (January 2004)

    Google Scholar 

  7. Miyahara, T., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H.: Discovery of frequent tree structured patterns in semistructured web documents. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS, vol. 2035, pp. 47–52. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  8. Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data, pp. 295–306 (1998)

    Google Scholar 

  9. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the Fifth International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA (June 2002)

    Google Scholar 

  10. Srikant, R., Agrawal, R.: Mining Sequential Patterns: Generalizations and Performance Improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)

    Google Scholar 

  11. Termier, A., Rousset, M.-C., Se’bag, M.: Treefinder: a first step towards xml data mining. In: International Conference on Data Mining (ICDM 2002), Maebashi City, Japan (2002)

    Google Scholar 

  12. Wang, K., Liu, H.: Discovering structural association of semistructured data. Knowledge and Data Engineering 12(2), 353–371 (2000)

    Article  Google Scholar 

  13. Zaki, M.: Efficiently mining frequent trees in a forest. In: KDD 2002 (July 2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Garboni, C., Masseglia, F., Trousse, B. (2006). Sequential Pattern Mining for Structure-Based XML Document Classification. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds) Advances in XML Information Retrieval and Evaluation. INEX 2005. Lecture Notes in Computer Science, vol 3977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-34963-1_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-34963-1_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-34962-4

  • Online ISBN: 978-3-540-34963-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics