Skip to main content
Log in

Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance

  • S.I.: OR in Neuroscience II
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

We consider a clustering problem in which the data objects are rooted m-ary trees with known node correspondence. We assume that the nodes of the trees are unweighted, but the edges can be unweighted or weighted. We measure the similarity and distance between two trees using vertex/edge overlap (VEO) and graph edit distance (GED), respectively. For both measures, we first study the problem of finding a centroid tree of a given cluster of trees in both the unweighted and weighted edge cases. We compute the optimal centroid tree of a given cluster for all measures except the weighted VEO for which a heuristic is developed. We then propose k-means based algorithms that repeat cluster assignment and centroid update steps until convergence. The initial centroid trees are constructed based on the properties of the data. The assignment steps utilize unweighted or weighted versions of VEO or GED to assign each tree to the most similar centroid tree. In the update steps, each centroid tree is updated by considering the trees assigned to it. The proposed algorithms are compared with the traditional k-modes and k-means on randomly generated datasets and shown to be more effective and robust (to outliers) in separating trees into clusters. We also apply our algorithms on a real world brain artery data and show that the previously observed age and sex effects on brain artery structures can be revealed better by means of clustering with our algorithms than the traditional k-modes and k-means.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Aggarwal, C. C., & Wang, H. (2010). A survey of clustering algorithms for graph data. In Aggarwal C. C. (Ed.), Managing and mining graph data, (pp. 275–301). Boston, MA: Springer. https://doi.org/10.1007/978-1-4419-6045-0_9.

    Chapter  Google Scholar 

  • Aggarwal, C. C., Ta, N., Wang, J., Feng, J., & Zaki, M. (2007). Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 46–55). ACM.

  • Aydin, B., Pataki, G., Wang, H., Bullitt, E., & Marron, J. (2009). A principal component analysis for trees. The Annals of Applied Statistics, 3, 1597–1615.

    Article  Google Scholar 

  • Bacciu, D., & Castellana, D. (2019). Bayesian mixtures of hidden tree Markov models for structured data clustering. Neurocomputing, 342, 49–59.

    Article  Google Scholar 

  • Bendich, P., Marron, J. S., Miller, E., Pieloch, A., & Skwerer, S. (2016). Persistent homology analysis of brain artery trees. The Annals of Applied Statistics, 10(1), 198.

    Article  Google Scholar 

  • Biao, L., Kejun, Z., Huamin, F., & Yang, L. (2014). A new approach of clustering malicious javascript. In 2014 5th IEEE international conference on software engineering and service science (ICSESS) (pp. 157–160). IEEE.

  • Bullitt, E., Zeng, D., Mortamet, B., Ghosh, A., Aylward, S. R., Lin, W., et al. (2010). The effects of healthy aging on intracerebral blood vessels visualized by magnetic resonance angiography. Neurobiology of Aging, 31(2), 290–300.

    Article  Google Scholar 

  • Chawathe, S. S. (1999). Comparing hierarchical data in external memory. VLDB, 99, 90–101.

    Google Scholar 

  • Chen, X., Sun, W., Wang, B., Li, Z., Wang, X., & Ye, Y. (2018). Spectral clustering of customer transaction data with a two-level subspace weighting method. IEEE Transactions on Cybernetics, 49(9), 3230–3241.

    Article  Google Scholar 

  • Dickinson, P., & Kraetzl, M. (2003). Novel approaches in modelling dynamics of networked surveillance environment. In Proceedings of the 6th International Conference of Information Fusion, (Vol. 1, pp. 302–309).

  • Erdem, A., & Tari, S. (2010). A similarity-based approach for shape classification using Aslan skeletons. Pattern Recognition Letters, 31(13), 2024–2032.

    Article  Google Scholar 

  • Flesia, A. (2009). Unsupervised classification of tree structured objects. BIOMAT, 2008, 280–299.

    Google Scholar 

  • Gowda, T., & Mattmann, C. A. (July 2016) Clustering web pages based on structure and style similarity (application paper). In 2016 IEEE 17th International conference on information reuse and integration (IRI) (pp. 175–180).

  • Heumann, H., & Wittum, G. (2009). The tree-edit-distance, a measure for quantifying neuronal morphology. Neuroinformatics, 7(3), 179–190.

    Article  Google Scholar 

  • Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD, 3(8), 34–39.

    Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.

    Article  Google Scholar 

  • Khakhutskyy, V., Schwarzfischer, M., Hubig, N., Plant, C., Marr, C., Rieger, M. A., Schroeder, T., & Theis, F. J. (2014). Centroid clustering of cellular lineage trees. In International conference on information technology in bio-and medical informatics (pp. 15–29), Springer.

  • Koutra, D., Vogelstein, J. T., & Faloutsos, C. (2013). Deltacon: A principled massive-graph similarity function. In Proceedings of the 2013 SIAM international conference on data mining (pp. 162–170). SIAM.

  • Lessa, F . A., Raiol, T., Brigido, M . M., Martins Neto, D . S., Walter, M . E . M., & Stadler, P . F. (2012). Clustering rfam 10.1: Clans, families, and classes. Genes, 3(3), 378–390.

    Article  Google Scholar 

  • Lu, N. & Wu, Y. (2015). Clustering of tree-structured data. In 2015 IEEE international conference on information and automation (pp. 1210–1215). IEEE.

  • Lu, N., & Miao, H. (2016). Clustering tree-structured data on manifold. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 1956–1968.

    Article  Google Scholar 

  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, (Vol. 1, pp. 281–297).

  • Marron, J. S., & Alonso, A. M. (2014). Overview of object oriented data analysis. Biometrical Journal, 56(5), 732–753.

    Article  Google Scholar 

  • Papadimitriou, P., Dasdan, A., & Garcia-Molina, H. (2010). Web graph similarity for anomaly detection. Journal of Internet Services and Applications, 1(1), 19–30.

    Article  Google Scholar 

  • Rosen, K. H. (2011). Discrete mathematics and its applications. New York: McGraw-Hill Education.

    Google Scholar 

  • Sanfeliu, A., & Fu, K.-S. (1983). A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 3, 353–362.

    Article  Google Scholar 

  • Shen, D., Shen, H., Bhamidi, S., Muñoz Maldonado, Y., Kim, Y., & Marron, J. S. (2014). Functional data analysis of tree data objects. Journal of Computational and Graphical Statistics, 23(2), 418–438.

    Article  Google Scholar 

  • Skwerer, S., Bullitt, E., Huckemann, S., Miller, E., Oguz, I., Owen, M., et al. (2014). Tree-oriented analysis of brain artery structure. Journal of Mathematical Imaging and Vision, 50(1–2), 126–143.

    Article  Google Scholar 

  • Takenaka, Y., & Wakao, T. (2015). Similarity measure among structures of local government statute books based on tree edit distance. In 2015 seventh international conference on knowledge and systems engineering (KSE) (pp. 49–54). IEEE.

  • Thota, H. S., Saradhi, V. V., & Venkatesh, T. (2013). Network traffic analysis using principal component graphs. In 11th Workshop on mining and learning with graphs.

  • Torsello, A., Hidovic-Rowe, D., & Pelillo, M. (2005). Polynomial-time metrics for attributed trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7), 1087–1099.

    Article  Google Scholar 

  • Torsello, A., Robles-Kelly, A., & Hancock, E. R. (2007). Discovering shape classes using tree edit-distance and pairwise clustering. International Journal of Computer Vision, 72(3), 259–285.

    Article  Google Scholar 

  • Tsang, H. H., & Wiese, K. C. (2009) Sarna-ensemble-predict: The effect of different dissimilarity metrics on a novel ensemble-based RNA secondary structure prediction algorithm. In IEEE symposium on computational intelligence in bioinformatics and computational biology, 2009. CIBCB’09 (pp. 8–15). IEEE.

  • Wilson, R. C., & Zhu, P. (2008). A study of graph spectra for comparing graphs and trees. Pattern Recognition, 41(9), 2833–2841.

    Article  Google Scholar 

  • Zhang, H., Wang, S., Wang, E. K., Li, Y., Zhang, Y., & Chu, D. (2017). Recommending e-books by multi-layer clustering and locality reconstruction. In 2017 IEEE 15th International conference on industrial informatics (INDIN) (pp. 1056–1061). IEEE.

  • Zhang, K. (1996). A constrained edit distance between unordered labeled trees. Algorithmica, 15(3), 205–222.

    Article  Google Scholar 

  • Zhao, Y., & Karypis, G. (2006). Criterion functions for clustering on high-dimensional data (pp. 211–237). Berlin: Springer.

    Google Scholar 

Download references

Acknowledgements

Derya Dinler was partially supported by the Scientific and Technological Research Council of Turkey under Grant 2211.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Derya Dinler.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dinler, D., Tural, M.K. & Ozdemirel, N.E. Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance. Ann Oper Res 289, 85–122 (2020). https://doi.org/10.1007/s10479-019-03505-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-019-03505-7

Keywords

Navigation