Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance

Dinler, Derya; Tural, Mustafa Kemal; Ozdemirel, Nur Evin

doi:10.1007/s10479-019-03505-7

Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance

S.I.: OR in Neuroscience II
Published: 01 January 2020

Volume 289, pages 85–122, (2020)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

789 Accesses
5 Citations
Explore all metrics

Abstract

We consider a clustering problem in which the data objects are rooted m-ary trees with known node correspondence. We assume that the nodes of the trees are unweighted, but the edges can be unweighted or weighted. We measure the similarity and distance between two trees using vertex/edge overlap (VEO) and graph edit distance (GED), respectively. For both measures, we first study the problem of finding a centroid tree of a given cluster of trees in both the unweighted and weighted edge cases. We compute the optimal centroid tree of a given cluster for all measures except the weighted VEO for which a heuristic is developed. We then propose k-means based algorithms that repeat cluster assignment and centroid update steps until convergence. The initial centroid trees are constructed based on the properties of the data. The assignment steps utilize unweighted or weighted versions of VEO or GED to assign each tree to the most similar centroid tree. In the update steps, each centroid tree is updated by considering the trees assigned to it. The proposed algorithms are compared with the traditional k-modes and k-means on randomly generated datasets and shown to be more effective and robust (to outliers) in separating trees into clusters. We also apply our algorithms on a real world brain artery data and show that the previously observed age and sex effects on brain artery structures can be revealed better by means of clustering with our algorithms than the traditional k-modes and k-means.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm

cs-means: Determining optimal number of clusters based on a level-of-similarity

Article 06 October 2020

Consensus of Clusterings Based on High-Order Dissimilarities

References

Aggarwal, C. C., & Wang, H. (2010). A survey of clustering algorithms for graph data. In Aggarwal C. C. (Ed.), Managing and mining graph data, (pp. 275–301). Boston, MA: Springer. https://doi.org/10.1007/978-1-4419-6045-0_9.
Chapter Google Scholar
Aggarwal, C. C., Ta, N., Wang, J., Feng, J., & Zaki, M. (2007). Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 46–55). ACM.
Aydin, B., Pataki, G., Wang, H., Bullitt, E., & Marron, J. (2009). A principal component analysis for trees. The Annals of Applied Statistics, 3, 1597–1615.
Article Google Scholar
Bacciu, D., & Castellana, D. (2019). Bayesian mixtures of hidden tree Markov models for structured data clustering. Neurocomputing, 342, 49–59.
Article Google Scholar
Bendich, P., Marron, J. S., Miller, E., Pieloch, A., & Skwerer, S. (2016). Persistent homology analysis of brain artery trees. The Annals of Applied Statistics, 10(1), 198.
Article Google Scholar
Biao, L., Kejun, Z., Huamin, F., & Yang, L. (2014). A new approach of clustering malicious javascript. In 2014 5th IEEE international conference on software engineering and service science (ICSESS) (pp. 157–160). IEEE.
Bullitt, E., Zeng, D., Mortamet, B., Ghosh, A., Aylward, S. R., Lin, W., et al. (2010). The effects of healthy aging on intracerebral blood vessels visualized by magnetic resonance angiography. Neurobiology of Aging, 31(2), 290–300.
Article Google Scholar
Chawathe, S. S. (1999). Comparing hierarchical data in external memory. VLDB, 99, 90–101.
Google Scholar
Chen, X., Sun, W., Wang, B., Li, Z., Wang, X., & Ye, Y. (2018). Spectral clustering of customer transaction data with a two-level subspace weighting method. IEEE Transactions on Cybernetics, 49(9), 3230–3241.
Article Google Scholar
Dickinson, P., & Kraetzl, M. (2003). Novel approaches in modelling dynamics of networked surveillance environment. In Proceedings of the 6th International Conference of Information Fusion, (Vol. 1, pp. 302–309).
Erdem, A., & Tari, S. (2010). A similarity-based approach for shape classification using Aslan skeletons. Pattern Recognition Letters, 31(13), 2024–2032.
Article Google Scholar
Flesia, A. (2009). Unsupervised classification of tree structured objects. BIOMAT, 2008, 280–299.
Google Scholar
Gowda, T., & Mattmann, C. A. (July 2016) Clustering web pages based on structure and style similarity (application paper). In 2016 IEEE 17th International conference on information reuse and integration (IRI) (pp. 175–180).
Heumann, H., & Wittum, G. (2009). The tree-edit-distance, a measure for quantifying neuronal morphology. Neuroinformatics, 7(3), 179–190.
Article Google Scholar
Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD, 3(8), 34–39.
Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Article Google Scholar
Khakhutskyy, V., Schwarzfischer, M., Hubig, N., Plant, C., Marr, C., Rieger, M. A., Schroeder, T., & Theis, F. J. (2014). Centroid clustering of cellular lineage trees. In International conference on information technology in bio-and medical informatics (pp. 15–29), Springer.
Koutra, D., Vogelstein, J. T., & Faloutsos, C. (2013). Deltacon: A principled massive-graph similarity function. In Proceedings of the 2013 SIAM international conference on data mining (pp. 162–170). SIAM.
Lessa, F . A., Raiol, T., Brigido, M . M., Martins Neto, D . S., Walter, M . E . M., & Stadler, P . F. (2012). Clustering rfam 10.1: Clans, families, and classes. Genes, 3(3), 378–390.
Article Google Scholar
Lu, N. & Wu, Y. (2015). Clustering of tree-structured data. In 2015 IEEE international conference on information and automation (pp. 1210–1215). IEEE.
Lu, N., & Miao, H. (2016). Clustering tree-structured data on manifold. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 1956–1968.
Article Google Scholar
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, (Vol. 1, pp. 281–297).
Marron, J. S., & Alonso, A. M. (2014). Overview of object oriented data analysis. Biometrical Journal, 56(5), 732–753.
Article Google Scholar
Papadimitriou, P., Dasdan, A., & Garcia-Molina, H. (2010). Web graph similarity for anomaly detection. Journal of Internet Services and Applications, 1(1), 19–30.
Article Google Scholar
Rosen, K. H. (2011). Discrete mathematics and its applications. New York: McGraw-Hill Education.
Google Scholar
Sanfeliu, A., & Fu, K.-S. (1983). A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 3, 353–362.
Article Google Scholar
Shen, D., Shen, H., Bhamidi, S., Muñoz Maldonado, Y., Kim, Y., & Marron, J. S. (2014). Functional data analysis of tree data objects. Journal of Computational and Graphical Statistics, 23(2), 418–438.
Article Google Scholar
Skwerer, S., Bullitt, E., Huckemann, S., Miller, E., Oguz, I., Owen, M., et al. (2014). Tree-oriented analysis of brain artery structure. Journal of Mathematical Imaging and Vision, 50(1–2), 126–143.
Article Google Scholar
Takenaka, Y., & Wakao, T. (2015). Similarity measure among structures of local government statute books based on tree edit distance. In 2015 seventh international conference on knowledge and systems engineering (KSE) (pp. 49–54). IEEE.
Thota, H. S., Saradhi, V. V., & Venkatesh, T. (2013). Network traffic analysis using principal component graphs. In 11th Workshop on mining and learning with graphs.
Torsello, A., Hidovic-Rowe, D., & Pelillo, M. (2005). Polynomial-time metrics for attributed trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7), 1087–1099.
Article Google Scholar
Torsello, A., Robles-Kelly, A., & Hancock, E. R. (2007). Discovering shape classes using tree edit-distance and pairwise clustering. International Journal of Computer Vision, 72(3), 259–285.
Article Google Scholar
Tsang, H. H., & Wiese, K. C. (2009) Sarna-ensemble-predict: The effect of different dissimilarity metrics on a novel ensemble-based RNA secondary structure prediction algorithm. In IEEE symposium on computational intelligence in bioinformatics and computational biology, 2009. CIBCB’09 (pp. 8–15). IEEE.
Wilson, R. C., & Zhu, P. (2008). A study of graph spectra for comparing graphs and trees. Pattern Recognition, 41(9), 2833–2841.
Article Google Scholar
Zhang, H., Wang, S., Wang, E. K., Li, Y., Zhang, Y., & Chu, D. (2017). Recommending e-books by multi-layer clustering and locality reconstruction. In 2017 IEEE 15th International conference on industrial informatics (INDIN) (pp. 1056–1061). IEEE.
Zhang, K. (1996). A constrained edit distance between unordered labeled trees. Algorithmica, 15(3), 205–222.
Article Google Scholar
Zhao, Y., & Karypis, G. (2006). Criterion functions for clustering on high-dimensional data (pp. 211–237). Berlin: Springer.
Google Scholar

Download references

Acknowledgements

Derya Dinler was partially supported by the Scientific and Technological Research Council of Turkey under Grant 2211.

Author information

Authors and Affiliations

Department of Industrial Engineering, Hacettepe University, Ankara, Turkey
Derya Dinler
Department of Industrial Engineering, Middle East Technical University, Ankara, Turkey
Mustafa Kemal Tural & Nur Evin Ozdemirel

Authors

Derya Dinler
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Kemal Tural
View author publications
You can also search for this author in PubMed Google Scholar
Nur Evin Ozdemirel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Derya Dinler.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dinler, D., Tural, M.K. & Ozdemirel, N.E. Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance. Ann Oper Res 289, 85–122 (2020). https://doi.org/10.1007/s10479-019-03505-7

Download citation

Published: 01 January 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10479-019-03505-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance

Abstract

Access this article

Similar content being viewed by others

Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm

cs-means: Determining optimal number of clusters based on a level-of-similarity

Consensus of Clusterings Based on High-Order Dissimilarities

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance

Abstract

Access this article

Similar content being viewed by others

Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm

cs-means: Determining optimal number of clusters based on a level-of-similarity

Consensus of Clusterings Based on High-Order Dissimilarities

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation