Skip to main content
Log in

XML document clustering: techniques and challenges

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

The increasing availability of heterogeneous XML sources has raised a number of issues concerning how to represent and manage these semi-structured data. In recent years due to the importance of managing these resources and extracting knowledge from them, lots of methods have been proposed in order to represent and cluster them in different ways. Different similarity measures have been extended and also in some context semantic issues have been taken into account. In this context, we review different XML clustering methods with considering different representation methods such as tree based and vector based with use of different similarity measures. We also propose taxonomy for these proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Alshahat A, Algergawy A (2010) Management of xml data by means of schema matching. Publisher Dr, Hut. ISBN 3868533834, 9783868533835

  • Antonellis P, Makris C, Tsirakis N (2008) XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries. In: Proceedings of the 2008 ACM symposium on applied computing (SAC ’08). ACM, New York, NY, USA, pp 1081–1088

  • Bray T, Paoli J (2000) Extensible markup language (XML) 1.0, 2nd edn. Sperberg-McQueen CM University of Illinois at Chicago and text encoding initiative. Sun Microsystems Inc, Eve Maler

  • Dalamagas T, Cheng T, Winkel KJ, Sellis T (2006) A methodology for clustering XML documents by structure. Inf Syst 31(3):187–228

    Google Scholar 

  • Doucet A, Lehtonen M (2006) Unsupervised classification of text-centric XML document collections. In: Comparative evaluation of XML information retrieval systems, 5th international workshop of the initiative for the evaluation of XML retrieval, INEX 2006, Dagstuhl Castle, Germany, December 17–20, 2006, Revised and selected papers. Volume 4518 of Lecture Notes in Computer Science. Springer, pp 497–509

  • Doucet A, Myka HA (2002) Naive clustering of a large XML document collection. In: Proceedings of the INEX annual ERCIM, workshop, pp 81–88

  • Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of the international workshop on the web and databases (WebDB)

  • Flesca S, Manco G, Masciari E, Pontieri L (2005) Fast detection of XML structural similarity. IEEE Trans Knowl Data Eng 17(2):160–175

    Article  Google Scholar 

  • Kozielski M (2007) Application of different clustering algorithms to multilevel clustering of XML documents, vol 16. Institute of Informatics, Silesian University of Technology, Akademicka Gliwice, pp 44–100

  • Lian W, Cheung DW, Mamoulis N, Yiu SM (2004) An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans Knowl Data Eng 16(1):82–96

    Article  Google Scholar 

  • Nayak R (2006) Investigating semantic measures in XML clustering. In: Proceedings of the (2006) IEEE/WIC/ACM international conference on web intelligence (WI ’06). IEEE Computer Society, Washington, DC, USA, pp 1042–1045

  • Nayak R, De Vries CM, Kutty S, Geva Sh, Denoyer L, Gallinari P (2009) Overview of the INEX 2009 XML mining track : clustering and classification of XML documents. In: Focused retrieval and evaluation: proceedings of 8th international workshop of the initiative for the evaluation of XML retrieval, INEX (2009). Springer, Brisbane, Queensland, pp 366–378

  • Nayak R, Xu S (2006) XCLS: a fast and effective clustering algorithm for heterogenous XML documents. In: Ng WK, Kitsuregawa M, Chang K (eds) Advances in knowledge discovery and data mining: proceedings of the 10th Pacific-Asia conference (LNCS 3918) 9–12 April, 2006, Singapore

  • Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings ACM SIGMOD WebDB (international workshop on the web and databases), workshop, pp 61–66

  • Ruso LR (2012) XML data mining, part 3: clustering XML documents for improved data mining. DW and BI consultant, computershare technology services Australia, La Trobe University Australia, Development Team Lead

  • Tagarelli A, Greco S (2006) Toward semantic XML clustering. In: Proceedings of the sixth SIAM international conference on data mining, University of Calabria

  • Tagarelli A, Greco S (2010) Semantic clustering of XML documents. ACM Trans Inf Syst 28(1):3

    Article  Google Scholar 

  • Yang J, Cheung W K, Chen X (2005) Learning the Kernel matrix for XML document clustering. In: IEEE international conference on e-technology, e-commerce and e-service, pp 353–358

  • Yang R, Kalnis P, Tung A (2005) Similarity evaluation on tree-structured data. In: Proceedings of the ACM international conference on management of data, pp 754–765

  • Yoon J, Raghavan V, Chakilam V, Kerschberg V (2001) BitCube: a three-dimensional bitmap indexing for XML documents. J Intell Inf Syst 17:241–254

    Article  MATH  Google Scholar 

  • Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262

    Article  MATH  MathSciNet  Google Scholar 

  • Zhao B, Zhang Y, Zhang H (2008) A robust clustering method for XML documents. In: International conference on information management, innovation management and industrial engineering

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elaheh Asghari.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Asghari, E., KeyvanPour, M. XML document clustering: techniques and challenges. Artif Intell Rev 43, 417–436 (2015). https://doi.org/10.1007/s10462-012-9379-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-012-9379-2

Keywords

Navigation