Abstract
Unsupervised cluster matching is a task to find matching between clusters of objects in different domains. Examples include matching word clusters in different languages without dictionaries or parallel sentences and matching user communities across different friendship networks. Existing methods assume that every object is assigned into a cluster. However, in real-world applications, some objects would not form clusters. These irrelevant objects deteriorate the cluster matching performance since mistakenly estimated matching affect on estimation of matching of other objects. In this paper, we propose a probabilistic model for robust unsupervised cluster matching that discovers relevance of objects and matching of object clusters, simultaneously, given multiple networks. The proposed method finds correspondence only for relevant objects, and keeps irrelevant objects unmatched, which enables us to improve the matching performance since the adverse impact of irrelevant objects is eliminated. With the proposed method, relevant objects in different networks are clustered into a shared set of clusters by assuming that different networks are generated from a common network probabilistic model, which is an extension of stochastic block models. Objects assigned into the same clusters are considered as matched. Edges for irrelevant objects are assumed to be generated from a noise distribution irrespective of cluster assignments. We present an efficient Bayesian inference procedure of the proposed model based on collapsed Gibbs sampling. In our experiments, we demonstrate the effectiveness of the proposed method using synthetic and real-world data sets, including multilingual corpora and movie ratings.
Similar content being viewed by others
Notes
Available at http://www.cs.nyu.edu/~roweis/data.html.
Available at http://ai.stanford.edu/~gal/.
References
Airoldi E, Blei D, Fienberg S, Xing E (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014
Albert R, Barabási A (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47
Barabási A, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Blackwell D, MacQueen JB (1973) Ferguson distributions via Pólya urn schemes. Ann Stat 1(2):353–355
Clauset A, Moore C, Newman M (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98–101
Djuric N, Grbovic M, Vucetic S (2012) Convex kernelized sorting. In: Proceedings of the 26th AAAI conference on artificial intelligence
Gale WA, Church KW (1991) A program for aligning sentences in bilingual corpora. In: Proceedings of the 29th annual meeting on association for computational linguistics, pp 177–184
Girvan M, Newman M (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826
Haghighi A, Liang P, Berg-Kirkpatrick T, Klein D (2008) Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-08: HLT, pp 771–779
Hoffman MD, Blei DM, Wang C, Paisley JW (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347
Ishiguro K, Iwata T, Ueda N, Tenenbaum J (2010) Dynamic infinite relational model for time-varying relational data analysis. Adv Neural Inf Process Syst 23:919–927
Ishiguro K, Ueda N, Sawada H (2012) Subset infinite relational models. In: International conference on artificial intelligence and statistics, pp 547–555
Iwata T, Hirao T, Ueda N (2013) Unsupervised cluster matching via probabilistic latent variable models. In: Proceedings of the 27th AAAI conference on artificial intelligence
Iwata T, Lloyd J, Ghahramani Z (2016) Unsupervised many-to-many object matching for relational data. IEEE Trans Pattern Anal Mach Intell 38(3):607–619
Kemp C, Tenenbaum J, Griffiths T, Yamada T, Ueda N (2006) Learning systems of concepts with an infinite relational model. In: Proceedings of the 20th AAAI conference on artificial intelligence, vol 21, p 381
Klami A (2012) Variational Bayesian matching. In: Proceedings of the 4th Asian conference on machine learning, pp 205–220
Klami A (2013) Bayesian object matching. Mach Learn 92:225–250
Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the 12th international conference on machine learning, pp 331–339
Li B, Yang Q, Xue X (2009) Transfer learning for collaborative filtering via a rating-matrix generative model. In: Proceedings of the 26th international conference on machine learning, pp 617–624
Miller K, Griffiths T, Jordan M (2009) Nonparametric latent feature models for link prediction. Adv Neural Inf Process Syst 22:1276–1284
Nowicki K, Snijders T (2001) Estimation and prediction for stochastic blockstructures. J Am Stat Assoc 96(455):1077–1087
Quadrianto N, Smola A, Song L, Tuytelaars T (2010) Kernelized sorting. IEEE Trans Pattern Anal Mach Intell 32(10):1809–1821
Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th annual meeting on association for computational linguistics, pp 519–526
Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 966–973
Wang Y, Wong G (1987) Stochastic blockmodels for directed graphs. J Am Stat Assoc 82(397):8–19
Watts D, Strogatz S (1998) Collective dynamics of ‘small-world’ networks. Nature 393:440–442
Williamson S, Dubey A, Xing EP (2013) Parallel Markov Chain Monte Carlo for nonparametric mixture models. In: Proceedings of the 30th international conference on machine learning, pp 98–106
Yamada M, Sugiyama M (2011) Cross-domain object matching with model selection. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 807–815
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Jian Pei.
Rights and permissions
About this article
Cite this article
Iwata, T., Ishiguro, K. Robust unsupervised cluster matching for network data. Data Min Knowl Disc 31, 1132–1154 (2017). https://doi.org/10.1007/s10618-017-0509-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-017-0509-y