XML clustering: a review of structural approaches

Maciej Piernik; Dariusz Brzezinski; Tadeusz Morzy; Anna Lesniewska

doi:10.1017/S0269888914000216

XML clustering: a review of structural approaches

Published online by Cambridge University Press: 29 October 2014

Tadeusz Morzy and

Maciej Piernik: Affiliation:
Institute of Computing Science, Poznan University of Technology, 60-965 Poznan, Poland; e-mail: mpiernik@cs.put.poznan.pl, dbrzezinski@cs.put.poznan.pl, tmorzy@cs.put.poznan.pl, alesniewska@cs.put.poznan.pl
Dariusz Brzezinski: Affiliation:
Institute of Computing Science, Poznan University of Technology, 60-965 Poznan, Poland; e-mail: mpiernik@cs.put.poznan.pl, dbrzezinski@cs.put.poznan.pl, tmorzy@cs.put.poznan.pl, alesniewska@cs.put.poznan.pl
Tadeusz Morzy: Affiliation:
Institute of Computing Science, Poznan University of Technology, 60-965 Poznan, Poland; e-mail: mpiernik@cs.put.poznan.pl, dbrzezinski@cs.put.poznan.pl, tmorzy@cs.put.poznan.pl, alesniewska@cs.put.poznan.pl
Anna Lesniewska: Affiliation:
Institute of Computing Science, Poznan University of Technology, 60-965 Poznan, Poland; e-mail: mpiernik@cs.put.poznan.pl, dbrzezinski@cs.put.poznan.pl, tmorzy@cs.put.poznan.pl, alesniewska@cs.put.poznan.pl

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

With its presence in data integration, chemistry, biological, and geographic systems, eXtensible Markup Language (XML) has become an important standard not only in computer science. A common problem among the mentioned applications involves structural clustering of XML documents—an issue that has been thoroughly studied and led to the creation of a myriad of approaches. In this paper, we present a comprehensive review of structural XML clustering. First, we provide a basic introduction to the problem and highlight the main challenges in this research area. Subsequently, we divide the problem into three subtasks and discuss the most common document representations, structural similarity measures, and clustering algorithms. In addition, we present the most popular evaluation measures, which can be used to estimate clustering quality. Finally, we analyze and compare 23 state-of-the-art approaches and arrange them in an original taxonomy. By providing an up-to-date analysis of existing structural XML clustering algorithms, we hope to showcase methods suitable for current applications and draw lines of future research.

Type: Articles
Information: The Knowledge Engineering Review , Volume 30 , Issue 3 , May 2015 , pp. 297 - 323

DOI: https://doi.org/10.1017/S0269888914000216 [Opens in a new window]
Copyright: © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Achard, F., Vaysseix, G. & Barillot, E. 2001. XML, bioinformatics and data integration. Bioinformatics 17(1), 115–125.CrossRef Google Scholar PubMed

Aggarwal, C. C., Ta, N., Wang, J., Feng, J. & Zaki, M. 2007. XProj: a framework for projected structural clustering of XML documents. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’07, 46–55.Google Scholar

Aggarwal, C. C. & Zhai, C. (eds) 2012. Mining Text Data, Springer.CrossRef Google Scholar

Aïtelhadj, A., Boughanem, M., Mezghiche, M. & Souam, F. 2012. Using structural similarity for clustering XML documents. Knowledge and Information Systems 32(1), 109–139.CrossRef Google Scholar

Algergawy, A., Mesiti, M., Nayak, R. & Saake, G. 2011. XML data clustering: an overview. ACM Computing Surveys 43(4), 25:1–25:41.CrossRef Google Scholar

Alishahi, M., Naghibzadeh, M. & Aski, B. S. 2010. Tag name structure-based clustering of XML documents. International Journal of Electrical and Computer Engineering 2(1), 119–126.CrossRef Google Scholar

Andreopoulos, B., An, A., Wang, X. & Schroeder, M. 2009. A roadmap of clustering algorithms: finding a match for a biomedical application. Briefings in Bioinformatics 10(3), 297–314.CrossRef Google Scholar PubMed

Ankerst, M., Breunig, M. M., Kriegel, H.-P. & Sander, J. 1999. OPTICS: ordering points to identify the clustering structure. SIGMOD Record 28(2), 49–60.CrossRef Google Scholar

Antonellis, P., Makris, C. & Tsirakis, N. 2008. XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries. In Proceedings of the 2008 International Conference on Advances in Computer, SAC’08, 1081–1088.Google Scholar

Ausbrooks, R., Buswell, S., Carlisle, D., Chavchanidze, G., Dalmas, S., Devitt, S., Diaz, A., Dooley, S., Hunter, R., Ion, P., Kohlhase, M., Lazrek, A., Libbrecht, P., Miller, B., MinerR., (deceased) R., (deceased), Rowley, C., Sargent, M., Smith, B., Soiffer, N., Sutor, R., & Watt, S. 2014. Mathematical Markup Language (MathML), Version 3.0, 2nd Edition. In Carlisle, D., Ion, P. & Miner, R. (eds). W3C, http://www.w3.org/TR/MathML3/.Google Scholar

Baeza-Yates, R. A. & Ribeiro-Neto, B. A. 1999. Modern Information Retrieval, ACM Press/Addison-Wesley.Google Scholar

Bennett, C. H., Gacs, P., Li, M., Vitanyi, P. M. B. & Zurek, W. H. 1998. Information distance. IEEE Transactions on Information Theory 44(4), 1407–1423.CrossRef Google Scholar

Bertino, E., Guerrini, G. & Mesiti, M. 2008. Measuring the structural similarity among XML documents and DTDs. Journal of Intelligent Information Systems 30(1), 55–92.CrossRef Google Scholar

Brzezinski, D., Lesniewska, A., Morzy, T. & Piernik, M. 2011. XCleaner: a new method for clustering XML documents by structure. Control and Cybernetics 40(3), 877–891.Google Scholar

Buttler, D. 2004. A short survey of document structure similarity algorithms. In Proceedings of the 5th International Conference on Internet Computing, 3–9.Google Scholar

Candillier, L., Tellier, I. & Torre, F. 2006. Transforming XML trees for efficient classification and clustering. In Proceedings of the 4th International Conference of the Initiative for the Evaluation of XML Retrieval, INEX’05, 469–480.Google Scholar

Chawathe, S. S. 1999. Comparing hierarchical data in external memory. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB’99, 90–101.Google Scholar

Chawathe, S. S., Rajaraman, A., Garcia-Molina, H. & Widom, J. 1996. Change detection in hierarchically structured information. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD ’96, 493–504.Google Scholar

Costa, G., Manco, G., Ortale, R. & Tagarelli, A. 2004. A tree-based approach to clustering XML documents by structure. In Proccedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD’04, 137–148.Google Scholar

Cover, T. M. & Thomas, J. A. 1991. Elements of Information Theory, Wiley-Interscience.Google Scholar

Crescenzi, V., Mecca, G. & Merialdo, P. 2001. RoadRunner: towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB’01, 109–118.Google Scholar

Crescenzi, V., Merialdo, P. & Missier, P. 2005. Clustering web pages based on their structure. Data & Knowledge Engineering 54(3), 279–299.CrossRef Google Scholar

Dalamagas, T., Cheng, T., Winkel, K.-J. & Sellis, T. 2006. A methodology for clustering XML documents by structure. Information Systems 31(3), 187–228.CrossRef Google Scholar

Dalamagas, T., Cheng, T., Winkel, K.-J. & Sellis, T. K. 2004. Clustering XML documents by structure. In Proceedings of the 3rd Hellenic Conference on Artificial Intelligence, SETN’04, 112–121.Google Scholar

Dempster, A. P., Laird, N. M. & Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38.CrossRef Google Scholar

Dolin, R. H., Alschuler, L., Boyer, S., Beebe, C., Behlen, F. M., Biron, P. V. & Shvo, A. S. 2006. HL7 clinical document architecture, release 2. Journal of the American Medical Informatics Association 13(1), 30–39.CrossRef Google Scholar PubMed

Doucet, A. & Lehtonen, M. 2006. Unsupervised classification of text-centric XML document collections. In Proceedings of the 5th International Conference of the Initiative for the Evaluation of XML Retrieval, INEX’06, 497–509.Google Scholar

Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD’96, 226–231.Google Scholar

Flesca, S., Manco, G., Masciari, E., Pontieri, L. & Pugliese, A. 2002. Detecting structural similarities between XML documents. In Proceedings of the 5th International Workshop on the Web and Databases, WebDB’02, 55–60.Google Scholar

Flesca, S., Manco, G., Masciari, E., Pontieri, L. & Pugliese, A. 2005. Fast detection of XML structural similarity. IEEE Transactions on Knowledge and Data Engineering 17(2), 160–175.CrossRef Google Scholar

Goldman, R. & Widom, J. 1997. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. Technical report no. 1997-50, Stanford InfoLab.Google Scholar

Guerrini, G., Mesiti, M. & Sanz, I. 2007. An overview of similarity measures for clustering XML documents. In Web Data Management Practices: Emerging Techniques and Technologies, chapter 3, Vakali, A. & Pallis, G. (eds)., 56--78. Idea Group Inc. (IGI).CrossRef Google Scholar

Guha, S., Rastogi, R. & Shim, K. 1998. CURE: an efficient clustering algorithm for large databases. SIGMOD Record 27(2), 73–84.CrossRef Google Scholar

Guha, S., Rastogi, R. & Shim, K. 2000. ROCK: a robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366.CrossRef Google Scholar

Hadzic, F., Hecker, M. & Tagarelli, A. 2011. XML document clustering using structure preserving flat representation of XML content and structure. In Proceedings of the 7th International Conference on Advanced Data Mining and Applications, ADMA’11, 403–416.Google Scholar

Hagenbuchner, M., Sperduti, A., Tsoi, A. C., Trentini, F., Scarselli, F. & Gori, M. 2006. Clustering XML documents using self-organizing maps for structures. In Proceedings of the 4th International Conference of the Initiative for the Evaluation of XML Retrieval, INEX’05, 481–496.Google Scholar

Han, J. 2005. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers Inc.Google Scholar

Hartigan, J. A. & Wong, M. A. 1979. A K-means clustering algorithm. Journal of the Royal Statistical Society Series C Applied Statistics 28, 100–108.Google Scholar

Helmer, S. 2007. Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB’07, VLDB Endowment, 1022–1032.Google Scholar

Helmer, S., Augsten, N. & Böhlen, M. 2012. Measuring structural similarity of semistructured data based on information-theoretic approaches. VLDB Journal 21(5), 677–702.CrossRef Google Scholar

Hubert, L. J. & Levin, J. R. 1976. A general statistical framework for accessing categorical. Psychological Bulletin 83, 1072–1082.CrossRef Google Scholar

Husek, D., Pokorny, J., Rezankova, H. & Snasel, V. 2007. Data clustering: from documents to the web. In Web Data Management Practices: Emerging Techniques and Technologies, chapter 1, Vakali, A. & Pallis, G. (eds)., 1--33. Idea Group Inc. (IGI).CrossRef Google Scholar

Hwang, J. H. & Ryu, K. H. 2004. A new XML clustering for structural retrieval. In Proceedings of the 23rd International Conference on Conceptual Modeling, ER’04, 377–387.Google Scholar

Hwang, J. H. & Ryu, K. H. 2010. A weighted common structure based clustering technique for XML documents. Journal of Systems and Software 83(7), 1267–1274.CrossRef Google Scholar

Jain, A. K., Murty, M. N. & Flynn, P. J. 1999. Data clustering: a review. ACM Computing Surveys 31(3), 264–323.CrossRef Google Scholar

Johnson, S. 1967. Hierarchical clustering schemes. Psychometrika 32, 241–254.CrossRef Google Scholar PubMed

Kohonen, T. 1989. Self-Organization and Associative Memory, 3rd edition. Springer-Verlag New York Inc.CrossRef Google Scholar

Kutty, S., Nayak, R. & Li, Y. 2010. Utilising semantic tags in XML clustering. In Proceedings of the 8th International Conference on Initiative for the Evaluation of XML Retrieval, INEX’09, 416–425.Google Scholar

Kutty, S., Nayak, R. & Li, Y. 2011. XML documents clustering using a tensor space model. In Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’11, 488–499.Google Scholar

Kutty, S., Tran, T., Nayak, R. & Li, Y. 2007. Clustering XML documents using closed frequent subtrees: a structural similarity approach. In Proceedings of the 5th International Conference on Initiative for the Evaluation of XML Retrieval, INEX’06, 183–194.Google Scholar

Kutty, S., Tran, T., Nayak, R. & Li, Y. 2008. Clustering XML documents using frequent subtrees. In Proceedings of the 6th International Conference on Initiative for the Evaluation of XML Retrieval, INEX’07, 436–445.Google Scholar

Lee, M. L., Yang, L. H., Hsu, W. & Yang, X. 2002. XClust: clustering XML schemas for effective integration. In Proceedings of the 11th International Conference on Information and Knowledge Management, CIKM’02, 292–299.Google Scholar

Lesniewska, A. 2009. Clustering XML documents by structure. In 13th East-European Conference Advances on Databases and Information Systems’, ADBIS’09, 238–246.Google Scholar

Lesniewska, A. & Primke, K. 2008. Finding Features on the Basis of the Structure of XML Documents. Technical Report RA 16/08, Poznan University of Technology.Google Scholar

Leung, H.-P., Chung, K. F.-L., fai Chan, S. C. & Luk, R. W. R. 2005. XML document clustering using common XPath. In International Workshop on Challenges in Web Information Retrieval and Integration, WIRI’05, 91–96.Google Scholar

Li, J., Liu, J., Liu, C., Wang, G., Yu, J. X. & Yangt, C. 2007. Computing structural similarity of source XML schemas against domain XML schema. In Proceedings of the 19th Australasian Database Conference, ADC’08, 155–164.Google Scholar

Lian, W., Cheung, D. W.-l., Mamoulis, N. & Yiu, S.-M. 2004. An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Information Theory 16(1), 82–96.Google Scholar

Moon, T. K. 1996. The expectation-maximization algorithm. IEEE Signal Processing Magazine 13(6), 47–60.CrossRef Google Scholar

Murray-Rust, P. & Rzepa, H. 1995. Chemical Markup Language, http://www.xml-cml.org/.Google Scholar

Nayak, R. 2008. Fast and effective clustering of XML data using structural information. Knowledge and Information Systems 14(2), 197–215.CrossRef Google Scholar

Nayak, R. & Iryadi, W. 2006. XMine: a methodology for mining XML structure. In Proceedings of the 8th Asia-Pacific Web Conference on Frontiers of WWW Research and Development, APWeb’06, 786–792.Google Scholar

Nierman, A. & Jagadish, H. V. 2002. Evaluating structural similarity in XML documents. In Proceedings of the 5th International Workshop on the Web and Databases, WebDB’02, 61–66.Google Scholar

Pawlik, M. & Augsten, N. 2011. RTED: a robust algorithm for the tree edit distance. The Proceedings of the VLDB Endowment 5(4), 334–345.Google Scholar

Piernik, M., Brzezinski, D. & Morzy, T. 2014. Clustering XML documents by patterns. Knowledge and Information Systems, Technical report number: RA-12/2013. Poznan University of Technology. (review in progress).Google Scholar

Pospech, T. 2009. GML – Geography Markup Language, GRIN Verlag.Google Scholar

Rafiei, D., Moise, D. L. & Sun, D. 2006. Finding syntactic similarities between XML documents. In Proceedings of the 17th International Conference on Database and Expert Systems Applications, DEXA’06, 512–516.Google Scholar

Rand, W. M. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850.CrossRef Google Scholar

Rousseeuw, P. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20(1), 53–65.CrossRef Google Scholar

Sankoff, D. & Kruskal, J. B. 2000. Time Warps, String Edits, and Macromolecules, Cambridge University Press.Google Scholar

Selkow, S. 1977. The tree-to-tree editing problem. Information Processing Letters 6(6), 184–186.CrossRef Google Scholar

Sokal, R. R. & Rohlf, F. J. 1962. The comparison of dendrograms by objective methods. Taxon 11(2), 33–40.CrossRef Google Scholar

Somervuo, P. & Kohonen, T. 2000. Clustering and visualization of large protein sequence databases by means of an extension on the self-organizing map. In Proceedings of the 3rd International Conference on Discovery Science, DS’00, 76–85.Google Scholar

Tagarelli, A. & Greco, S. 2010. Semantic clustering of XML documents. ACM Transactions on Information Systems 28(1), 3:1–3:56.CrossRef Google Scholar

Tai, K.-C. 1979. The tree-to-tree correction problem. The Journal of the ACM 26(3), 422–433.CrossRef Google Scholar

Tan, P.-N., Steinbach, M. & Kumar, V. 2005. Introduction to Data Mining, Addison-Wesley.Google Scholar

Tekli, J. & Chbeir, R. 2012. A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics. The Journal of Web Semantics 11, 14–40.CrossRef Google Scholar

Tekli, J., Chbeir, R. & Yetongnon, K. 2009. Survey: an overview on XML similarity: background, current trends and future directions. Computer Science Review 3(3), 151–173.CrossRef Google Scholar

Theobald, M. 2003. Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data. In Proceedings of the International Workshop on the Web and Databases, WebDB’03, 1–6.Google Scholar

Tran, T., Nayak, R. & Bruza, P. 2007. Document clustering using incremental and pairwise approaches. In Proceedings of the 5th International Conference on Initiative for the Evaluation of XML Retrieval, INEX’06, 4862, 222–233.Google Scholar

Tran, T., Nayak, R. & Bruza, P. 2008. Combining structure and content similarities for XML document clustering. In Proceedings of the 7th Australasian Data Mining Conference, AusDM’08, 219–226.Google Scholar

Vakali, A., Pallis, G. & Angelis, L. 2007. Clustering web information services. In Web Data Management Practices: Emerging Techniques and Technologies, chapter 2, Vakali, A. & Pallis, G. (eds)., 34--55. Idea Group Inc. (IGI).CrossRef Google Scholar

Vercoustre, A.-M., Fegas, M., Gul, S. & Lechevallier, Y. 2006. A flexible structured-based representation for XML document mining. In Proceedings of the 4th International Conference on Initiative for the Evaluation of XML Retrieval, INEX’05, 443–457.Google Scholar

Viyanon, W., Madria, S. K. & Bhowmick, S. S. 2008. XML data integration based on content and structure similarity using keys. In Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE, OTM’08, 484–493.Google Scholar

Wang, T., Liu, D.-X., Lin, X.-Z., Sun, W. & Ahmad, G. 2006. Clustering large scale of XML documents. In Proceedings of the 1st International Conference on Advances in Grid and Pervasive Computing, GPC’06, 447–455.Google Scholar

Wernecke, J. 2008. The KML Handbook: Geographic Visualization for the Web, 1st edition. Addison-Wesley Professional.Google Scholar

Westbrook, J., Ito, N., Nakamura, H., Henrick, K. & Berman, H. M. 2005. PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992.CrossRef Google Scholar PubMed

Wilson, R., Cobb, M., McCreedy, F., Ladner, R., Olivier, D., Lovitt, T., Shaw, K., Petry, F. & Abdelguerfi, M. 2003. Geographical data interchange using XML-enabled technology within the GIDB system. In XML Data Management: Native XML and XML-Enabled Database Systems, Chaudhri, A. B., Rashid, A. & Zicari, R. (eds). Addison Wesley, 354–374.Google Scholar

Xu, R. & Wunsch, I. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645–678.CrossRef Google Scholar PubMed

Yang, R., Kalnis, P. & Tung, A. K. H. 2005. Similarity evaluation on tree-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD’05, 754–765.Google Scholar

Yang, Y., Guan, X. & You, J. 2002. CLOPE: a fast and effective clustering algorithm for transactional data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 682–687.Google Scholar

Yoon, J. P., Raghavan, V., Chakilam, V. & Kerschberg, L. 2001. BitCube: a three-dimensional bitmap indexing for XML documents. Journal of Intelligent Information Systems 17(2–3), 241–254.CrossRef Google Scholar

Zaki, M. J. 2002. Efficiently mining frequent trees in a forest. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 71–80.Google Scholar

Zaki, M. J. & Aggarwal, C. C. 2006. XRules: an effective algorithm for structural classification of XML data. Machine Learning 62(1–2), 137–170.CrossRef Google Scholar

Zhang, K. & Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. The SIAM Journal on Computing 18(6), 1245–1262.CrossRef Google Scholar

Zhao, Y. & Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the 11th ACM International Conference on Information and Knowledge Management, CIKM’02, 515–524.Google Scholar

Zheng, Y.-T., Li, Y., Zha, Z.-J. & Chua, T.-S. 2011. Mining travel patterns from GPS-tagged photos. In Proceedings of the 17th International Multimedia Modeling Conference, MMM’11, 262–272.Google Scholar

Zhu, Y.-W., Ji, G.-L. & Sun, Q.-H. 2010. Clustering GML documents using maximal frequent induced subtrees. In Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discover, FSKD’10, 5, 2265–2269.Google Scholar

Article contents

XML clustering: a review of structural approaches

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests