Skip to main content

A New Sequential Mining Approach to XML Document Clustering*

  • Conference paper
Book cover Web Technologies Research and Development - APWeb 2005 (APWeb 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3399))

Included in the following conference series:

Abstract

XML has recently become very popular for representing semi-structured data and a standard for data exchange over the web because of its varied applicability in a number of applications. Therefore, XML documents form an important data mining domain. In this paper, we propose a new XML document clustering technique using sequential pattern mining algorithm. Our approach first extracts the representative structures of frequent patterns from schemaless XML documents by using a sequential pattern mining algorithm. And then, unlike most previous document clustering methods, we apply clustering algorithm for transactional data without a measure of pairwise similarity, considering that an XML document as a transaction and the extracted frequent structures of documents as the items of the transaction. We have experimented our clustering algorithm by comparing it with the previous methods. The experimental results show the effectiveness of the proposed method in performance and in producing clusters with higher cluster cohesion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kotasek, P., Zendulka, J.: An XML Framework Proposal for Knowledge Discovery in Database. In: The Fourth European Conference on Principles and Practice Knowledge Discovery in Databases (2000)

    Google Scholar 

  2. Wang, K., Liu, H.: Discovery Typical Structures of Documents: A Road Map Approach. In: ACM SIGIR, pp. 146–154 (1998)

    Google Scholar 

  3. Widom, J.: Data Management for XML: Research Directions. IEEE Computer Society Technical Commitee on Data Engineering, 44-52 (1999)

    Google Scholar 

  4. Nayak, R., Witt, R., Tonev, A.: Data Mining and XML Documents. Int. Conf. on Internetc Computing, 660–666 (2002)

    Google Scholar 

  5. Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proc. 11th ACM Int. Conf. on Information and Knowledge Management, pp. 292–299 (2002)

    Google Scholar 

  6. Shen, Y., Wang, B.: Clustering Schemaless XML Document. In: Proc. of the 11th Int. Conf. on Cooperative Information System, pp. 767–784 (2003)

    Google Scholar 

  7. Yoon, J., Raghavan, V., Chakilam, V.: BitCube: Clustering and Statistical Analysis for XML Documents. In: Proc. of the 13th Int. Conf. on Scientific and Statistical Database Management, pp. 241–254 (2001)

    Google Scholar 

  8. Doucet, A., Myka, H.A.: Naive Clustering of a Large XML Document Collection. In: The Proceedings of the 1st INEX, Germany (2002)

    Google Scholar 

  9. Lee, J.W., Lee, K., Kim, W.: Preparation for Semantics-Based XML Mining. In: IEEE Int. Conf. on Data Mining(ICDM), pp. 345–352 (2001)

    Google Scholar 

  10. Asai, T., Abe, K., Kawasoe, S., Arimura, S.H.: Efficient Substructure Discovery from Large Semi-structured Data. In: Proc. of the Second SIAM Int. Conf. on Data Mining, pp. 158–174 (2002)

    Google Scholar 

  11. Termier, A., Rouster, M.C., Sebag, M.: TreeFinder: A First Step towards XML Data Mining. In: IEEE Int. Conf. on Data Mining (ICDM), pp. 450–457 (2002)

    Google Scholar 

  12. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a review. ACM Computing Surveys 31 (1999)

    Google Scholar 

  13. Yang, Y., Guan, X., You, J.: CLOPE: A Fast and Effective Clustering Algorithm for Transaction Data. In: Proc. of the 8th ACM SIGKDD Int. Conf on Knowledge Discovery and Data Mining, pp. 682–687 (2002)

    Google Scholar 

  14. Wang, K., Xu, C.: Clustering Transactions Using Large Items. In: Proc. of ACM CIKM 1999, pp. 483–490 (1999)

    Google Scholar 

  15. http://sourceforge.net/projects/javawn

  16. Pei, J., Han, J., Asi, B.M., Pinto, H.: PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth. In: Int. Conf. Data Engineering(ICDE), pp. 215–224 (2001)

    Google Scholar 

  17. NIAGARA query engine., http://www.cs.wisc.edu/niagara/data.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hwang, J.H., Ryu, K.H. (2005). A New Sequential Mining Approach to XML Document Clustering*. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31849-1_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25207-8

  • Online ISBN: 978-3-540-31849-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics