Abstract
The increasing availability of heterogeneous XML sources has raised a number of issues concerning how to represent and manage these semi-structured data. In recent years due to the importance of managing these resources and extracting knowledge from them, lots of methods have been proposed in order to represent and cluster them in different ways. Different similarity measures have been extended and also in some context semantic issues have been taken into account. In this context, we review different XML clustering methods with considering different representation methods such as tree based and vector based with use of different similarity measures. We also propose taxonomy for these proposed methods.
Similar content being viewed by others
References
Alshahat A, Algergawy A (2010) Management of xml data by means of schema matching. Publisher Dr, Hut. ISBN 3868533834, 9783868533835
Antonellis P, Makris C, Tsirakis N (2008) XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries. In: Proceedings of the 2008 ACM symposium on applied computing (SAC ’08). ACM, New York, NY, USA, pp 1081–1088
Bray T, Paoli J (2000) Extensible markup language (XML) 1.0, 2nd edn. Sperberg-McQueen CM University of Illinois at Chicago and text encoding initiative. Sun Microsystems Inc, Eve Maler
Dalamagas T, Cheng T, Winkel KJ, Sellis T (2006) A methodology for clustering XML documents by structure. Inf Syst 31(3):187–228
Doucet A, Lehtonen M (2006) Unsupervised classification of text-centric XML document collections. In: Comparative evaluation of XML information retrieval systems, 5th international workshop of the initiative for the evaluation of XML retrieval, INEX 2006, Dagstuhl Castle, Germany, December 17–20, 2006, Revised and selected papers. Volume 4518 of Lecture Notes in Computer Science. Springer, pp 497–509
Doucet A, Myka HA (2002) Naive clustering of a large XML document collection. In: Proceedings of the INEX annual ERCIM, workshop, pp 81–88
Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of the international workshop on the web and databases (WebDB)
Flesca S, Manco G, Masciari E, Pontieri L (2005) Fast detection of XML structural similarity. IEEE Trans Knowl Data Eng 17(2):160–175
Kozielski M (2007) Application of different clustering algorithms to multilevel clustering of XML documents, vol 16. Institute of Informatics, Silesian University of Technology, Akademicka Gliwice, pp 44–100
Lian W, Cheung DW, Mamoulis N, Yiu SM (2004) An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans Knowl Data Eng 16(1):82–96
Nayak R (2006) Investigating semantic measures in XML clustering. In: Proceedings of the (2006) IEEE/WIC/ACM international conference on web intelligence (WI ’06). IEEE Computer Society, Washington, DC, USA, pp 1042–1045
Nayak R, De Vries CM, Kutty S, Geva Sh, Denoyer L, Gallinari P (2009) Overview of the INEX 2009 XML mining track : clustering and classification of XML documents. In: Focused retrieval and evaluation: proceedings of 8th international workshop of the initiative for the evaluation of XML retrieval, INEX (2009). Springer, Brisbane, Queensland, pp 366–378
Nayak R, Xu S (2006) XCLS: a fast and effective clustering algorithm for heterogenous XML documents. In: Ng WK, Kitsuregawa M, Chang K (eds) Advances in knowledge discovery and data mining: proceedings of the 10th Pacific-Asia conference (LNCS 3918) 9–12 April, 2006, Singapore
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings ACM SIGMOD WebDB (international workshop on the web and databases), workshop, pp 61–66
Ruso LR (2012) XML data mining, part 3: clustering XML documents for improved data mining. DW and BI consultant, computershare technology services Australia, La Trobe University Australia, Development Team Lead
Tagarelli A, Greco S (2006) Toward semantic XML clustering. In: Proceedings of the sixth SIAM international conference on data mining, University of Calabria
Tagarelli A, Greco S (2010) Semantic clustering of XML documents. ACM Trans Inf Syst 28(1):3
Yang J, Cheung W K, Chen X (2005) Learning the Kernel matrix for XML document clustering. In: IEEE international conference on e-technology, e-commerce and e-service, pp 353–358
Yang R, Kalnis P, Tung A (2005) Similarity evaluation on tree-structured data. In: Proceedings of the ACM international conference on management of data, pp 754–765
Yoon J, Raghavan V, Chakilam V, Kerschberg V (2001) BitCube: a three-dimensional bitmap indexing for XML documents. J Intell Inf Syst 17:241–254
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Zhao B, Zhang Y, Zhang H (2008) A robust clustering method for XML documents. In: International conference on information management, innovation management and industrial engineering
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Asghari, E., KeyvanPour, M. XML document clustering: techniques and challenges. Artif Intell Rev 43, 417–436 (2015). https://doi.org/10.1007/s10462-012-9379-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-012-9379-2