Abstract
We consider a clustering problem in which the data objects are rooted m-ary trees with known node correspondence. We assume that the nodes of the trees are unweighted, but the edges can be unweighted or weighted. We measure the similarity and distance between two trees using vertex/edge overlap (VEO) and graph edit distance (GED), respectively. For both measures, we first study the problem of finding a centroid tree of a given cluster of trees in both the unweighted and weighted edge cases. We compute the optimal centroid tree of a given cluster for all measures except the weighted VEO for which a heuristic is developed. We then propose k-means based algorithms that repeat cluster assignment and centroid update steps until convergence. The initial centroid trees are constructed based on the properties of the data. The assignment steps utilize unweighted or weighted versions of VEO or GED to assign each tree to the most similar centroid tree. In the update steps, each centroid tree is updated by considering the trees assigned to it. The proposed algorithms are compared with the traditional k-modes and k-means on randomly generated datasets and shown to be more effective and robust (to outliers) in separating trees into clusters. We also apply our algorithms on a real world brain artery data and show that the previously observed age and sex effects on brain artery structures can be revealed better by means of clustering with our algorithms than the traditional k-modes and k-means.
Similar content being viewed by others
References
Aggarwal, C. C., & Wang, H. (2010). A survey of clustering algorithms for graph data. In Aggarwal C. C. (Ed.), Managing and mining graph data, (pp. 275–301). Boston, MA: Springer. https://doi.org/10.1007/978-1-4419-6045-0_9.
Aggarwal, C. C., Ta, N., Wang, J., Feng, J., & Zaki, M. (2007). Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 46–55). ACM.
Aydin, B., Pataki, G., Wang, H., Bullitt, E., & Marron, J. (2009). A principal component analysis for trees. The Annals of Applied Statistics, 3, 1597–1615.
Bacciu, D., & Castellana, D. (2019). Bayesian mixtures of hidden tree Markov models for structured data clustering. Neurocomputing, 342, 49–59.
Bendich, P., Marron, J. S., Miller, E., Pieloch, A., & Skwerer, S. (2016). Persistent homology analysis of brain artery trees. The Annals of Applied Statistics, 10(1), 198.
Biao, L., Kejun, Z., Huamin, F., & Yang, L. (2014). A new approach of clustering malicious javascript. In 2014 5th IEEE international conference on software engineering and service science (ICSESS) (pp. 157–160). IEEE.
Bullitt, E., Zeng, D., Mortamet, B., Ghosh, A., Aylward, S. R., Lin, W., et al. (2010). The effects of healthy aging on intracerebral blood vessels visualized by magnetic resonance angiography. Neurobiology of Aging, 31(2), 290–300.
Chawathe, S. S. (1999). Comparing hierarchical data in external memory. VLDB, 99, 90–101.
Chen, X., Sun, W., Wang, B., Li, Z., Wang, X., & Ye, Y. (2018). Spectral clustering of customer transaction data with a two-level subspace weighting method. IEEE Transactions on Cybernetics, 49(9), 3230–3241.
Dickinson, P., & Kraetzl, M. (2003). Novel approaches in modelling dynamics of networked surveillance environment. In Proceedings of the 6th International Conference of Information Fusion, (Vol. 1, pp. 302–309).
Erdem, A., & Tari, S. (2010). A similarity-based approach for shape classification using Aslan skeletons. Pattern Recognition Letters, 31(13), 2024–2032.
Flesia, A. (2009). Unsupervised classification of tree structured objects. BIOMAT, 2008, 280–299.
Gowda, T., & Mattmann, C. A. (July 2016) Clustering web pages based on structure and style similarity (application paper). In 2016 IEEE 17th International conference on information reuse and integration (IRI) (pp. 175–180).
Heumann, H., & Wittum, G. (2009). The tree-edit-distance, a measure for quantifying neuronal morphology. Neuroinformatics, 7(3), 179–190.
Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD, 3(8), 34–39.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Khakhutskyy, V., Schwarzfischer, M., Hubig, N., Plant, C., Marr, C., Rieger, M. A., Schroeder, T., & Theis, F. J. (2014). Centroid clustering of cellular lineage trees. In International conference on information technology in bio-and medical informatics (pp. 15–29), Springer.
Koutra, D., Vogelstein, J. T., & Faloutsos, C. (2013). Deltacon: A principled massive-graph similarity function. In Proceedings of the 2013 SIAM international conference on data mining (pp. 162–170). SIAM.
Lessa, F . A., Raiol, T., Brigido, M . M., Martins Neto, D . S., Walter, M . E . M., & Stadler, P . F. (2012). Clustering rfam 10.1: Clans, families, and classes. Genes, 3(3), 378–390.
Lu, N. & Wu, Y. (2015). Clustering of tree-structured data. In 2015 IEEE international conference on information and automation (pp. 1210–1215). IEEE.
Lu, N., & Miao, H. (2016). Clustering tree-structured data on manifold. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 1956–1968.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, (Vol. 1, pp. 281–297).
Marron, J. S., & Alonso, A. M. (2014). Overview of object oriented data analysis. Biometrical Journal, 56(5), 732–753.
Papadimitriou, P., Dasdan, A., & Garcia-Molina, H. (2010). Web graph similarity for anomaly detection. Journal of Internet Services and Applications, 1(1), 19–30.
Rosen, K. H. (2011). Discrete mathematics and its applications. New York: McGraw-Hill Education.
Sanfeliu, A., & Fu, K.-S. (1983). A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 3, 353–362.
Shen, D., Shen, H., Bhamidi, S., Muñoz Maldonado, Y., Kim, Y., & Marron, J. S. (2014). Functional data analysis of tree data objects. Journal of Computational and Graphical Statistics, 23(2), 418–438.
Skwerer, S., Bullitt, E., Huckemann, S., Miller, E., Oguz, I., Owen, M., et al. (2014). Tree-oriented analysis of brain artery structure. Journal of Mathematical Imaging and Vision, 50(1–2), 126–143.
Takenaka, Y., & Wakao, T. (2015). Similarity measure among structures of local government statute books based on tree edit distance. In 2015 seventh international conference on knowledge and systems engineering (KSE) (pp. 49–54). IEEE.
Thota, H. S., Saradhi, V. V., & Venkatesh, T. (2013). Network traffic analysis using principal component graphs. In 11th Workshop on mining and learning with graphs.
Torsello, A., Hidovic-Rowe, D., & Pelillo, M. (2005). Polynomial-time metrics for attributed trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7), 1087–1099.
Torsello, A., Robles-Kelly, A., & Hancock, E. R. (2007). Discovering shape classes using tree edit-distance and pairwise clustering. International Journal of Computer Vision, 72(3), 259–285.
Tsang, H. H., & Wiese, K. C. (2009) Sarna-ensemble-predict: The effect of different dissimilarity metrics on a novel ensemble-based RNA secondary structure prediction algorithm. In IEEE symposium on computational intelligence in bioinformatics and computational biology, 2009. CIBCB’09 (pp. 8–15). IEEE.
Wilson, R. C., & Zhu, P. (2008). A study of graph spectra for comparing graphs and trees. Pattern Recognition, 41(9), 2833–2841.
Zhang, H., Wang, S., Wang, E. K., Li, Y., Zhang, Y., & Chu, D. (2017). Recommending e-books by multi-layer clustering and locality reconstruction. In 2017 IEEE 15th International conference on industrial informatics (INDIN) (pp. 1056–1061). IEEE.
Zhang, K. (1996). A constrained edit distance between unordered labeled trees. Algorithmica, 15(3), 205–222.
Zhao, Y., & Karypis, G. (2006). Criterion functions for clustering on high-dimensional data (pp. 211–237). Berlin: Springer.
Acknowledgements
Derya Dinler was partially supported by the Scientific and Technological Research Council of Turkey under Grant 2211.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dinler, D., Tural, M.K. & Ozdemirel, N.E. Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance. Ann Oper Res 289, 85–122 (2020). https://doi.org/10.1007/s10479-019-03505-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-019-03505-7