Abstract
Graphs are commonly used to express the communication of various data. Faced with uncertain data, we have probabilistic graphs. As a fundamental problem of such graphs, clustering has many applications in analyzing uncertain data. In this paper, we propose a novel method based on ensemble clustering for large probabilistic graphs. To generate ensemble clusters, we develop a set of probable possible worlds of the initial probabilistic graph. Then, we present a probabilistic co-association matrix as a consensus function to integrate base clustering results. It relies on co-occurrences of node pairs based on the probability of the corresponding common cluster graphs. Also, we apply two improvements in the steps before and after of ensembles generation. In the before step, we append neighborhood information based on node features to the initial graph to achieve a more accurate estimation of the probability between the nodes. In the after step, we use supervised metric learning-based Mahalanobis distance to automatically learn a metric from ensemble clusters. It aims to gain crucial features of the base clustering results. We evaluate our work using five real-world datasets and three clustering evaluation metrics, namely the Dunn index, Davies–Bouldin index, and Silhouette coefficient. The results show the impressive performance of clustering large probabilistic graphs.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-020-03429-1/MediaObjects/11227_2020_3429_Fig12_HTML.png)
Similar content being viewed by others
References
Zou Z, Li J, Gao H et al (2010) Mining frequent subgraph patterns from uncertain graph data. IEEE Trans Knowl Data Eng 22:1203–1218
Papapetrou O, Ioannou E, Skoutas D (2011) Efficient discovery of frequent subgraph patterns in uncertain graph databases. In: EDBT/ICDT’11, pp 355–366
Potamias M, Bonchi F, Gionis A et al (2010) k-nearest neighbors in uncertain graphs. Proc VLDB Endow 3(1):997–1008
Strehl A, Ghosh J (2003) Cluster ensembles—A knowledge reuse framework for combining partitions. J Mach Learn Res 3:583–617
Topchy A, Jain AK, Punch W (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881
Li F, Qian Y, Wang J et al (2019) Clustering ensemble based on sample’s stability. Artif Intell 273:37–55
Boongoen T, Iam-On N (2018) Cluster ensembles: a survey of approaches with recent extensions and applications. Comput Sci Rev 28:1–25
Alqurashi T, Wang W (2019) Clustering ensemble method. Int J Mach Learn Cyb 10:1227–1246
Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recogn Artif Intell 25(03):337–372
Kollios G, Potamias M, Terzi E (2013) Clustering large probabilistic graphs. IEEE Trans Knowl Data Eng 25(2):325–336
Ailon N, Charikar M, Newman A (2005) Aggregating Inconsistent Information: Ranking and Clustering. In: Proceedings of the ACM Symposium on Theory of Computing (STOC), pp 684–693
Halim Z, Waqas M, Hussain SF (2015) Clustering large probabilistic graphs using multi-population evolutionary algorithm. Inf Sci 317:78–95
Gu Y, Gao C, Cong G et al (2014) Effective and efficient clustering methods for correlated probabilistic graphs. IEEE Trans Knowl Data Eng 26(5):1117–1130
Ceccarello M, Fantozzi C, Pietracaprina A et al (2017) Clustering uncertain graphs. Proc VLDB Endowment 11(4):472–544
Halim Z, Khattak JH (2019) Density-based clustering of big probabilistic graphs. Evolv Syst 10(3):333–350
Qiu YX, Li RH, Li J, Qiao S et al (2018) Efficient structural clustering on probabilistic graphs. IEEE Trans Knowl Data Eng 31(10):1954–1968
Iam-On N, Boongoen T (2015) Comparative study of matrix refinement approaches for ensemble clustering. Mach Learn 98(1–2):269–300
Huang D, Wang CD, Wu JS et al (2019) Ultra-scalable spectral clustering and ensemble clustering. IEEE TKDE 32(6):1212–1226
Iam-On N, Boongoen T, Garrett S et al (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409
Yi J, Yang T, Jin R et al (2012) Robust ensemble clustering by matrix completion. In: Proceedings of IEEE International Conference on Data Mining (ICDM)
Fred AN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850
Lourenço A, Bulò SR, Rebagliati N et al (2015) Probabilistic consensus clustering using evidence accumulation. Mach Learn 98(1–2):331–357
Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of International Conference on Machine Learning (ICML)
Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326
Huang D, Wang CD, Lai JH (2018) Locally weighted ensemble clustering. IEEE Trans Cybern 48(5):1460–1473
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Huang D, Lai J, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recogn 50:131–142
Franek L, Jiang X (2014) Ensemble clustering by means of clustering embedding in vector spaces. Pattern Recogn 47(2):833–842
Weiszfeld E, Plastria F (2009) On the point for which the sum of the distances to n given points is minimum. Ann Oper Res 167(1):7–41
Benjelloun O, Sarma AD, Halevy A et al (2006) ULDBs: databases with uncertainty and lineage. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB), pp 953–964
Dalvi NN, Suciu D (2004) Efficient Query Evaluation on Probabilistic Databases. In: Proceedings of the 30th International Conference on Very Large Databases, Toronto, Canada.
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):1–30
Han K, Gui F, Xiao X et al (2019) Efficient and effective algorithms for clustering uncertain graphs. Proc VLDB Endow 12(6):667–680
Shamir R, Sharan R, Tsur D (2004) Cluster graph modification problems. Discrete Appl Math 144(1–2):173–182
Bian W, Tao D (2011) Learning a distance metric by empirical loss minimization. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp 1186–1191, Barcelona, Spain
Luo Y, Wen Y, Tao D (2016) On combining side information and unlabeled data for heterogeneous multi-task metric learning. In: The 25th International Joint Conference on Artificial Intelligence, pp 1809–1815, New York
Xiang S, Nie F, Zhang C (2008) Learning a mahalanobis distance metric for data clustering and classification. Pattern Recogn 41(12):3600–3612
Xing EP, Ng AY, Jordan MI et al (2003) Distance metric learning with application to clustering with side-information. In: Proceedings of the 15th International Conference on Neural Information Processing Systems, pp 521–528. Cambridge
Law MT, Yu Y, Cord M et al (2016) Closed-form training of mahalanobis distance for supervised clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3909–3917, Las Vegas, NV
McFee B, Lanckriet GR (2010) Metric learning to rank. In: Proceedings of the 27th International Conference on Machine Learning, pp 775–782, Haifa, Israel.
Mahalanobis PC (1936) On the generalized distance in statistics. In: Proceedings of the National Institute of Science, Calcutta, India
Bellet A, Habrard A, Sebban M (2015) Metric learning. Synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, San Rafael
Kulis B (2012) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364
Krogan NJ et al (2006) Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440(7084):637–643
Wu X, Ma T, Cao J et al (2018) A comparative study of clustering ensemble algorithms. Comput & Electr Eng 68:603–615
Leutbecher M (2018) Ensemble size: how suboptimal is less than infinity? Q J R Meteorol Soc 145:107–128
Buizza R, Palmer TN (1998) Impact of ensemble size on ensemble prediction. Mon Weather Rev 126:2503–2518
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Danesh, M., Dorrigiv, M. & Yaghmaee, F. Ensemble-based clustering of large probabilistic graphs using neighborhood and distance metric learning. J Supercomput 77, 4107–4134 (2021). https://doi.org/10.1007/s11227-020-03429-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03429-1