Robust unsupervised cluster matching for network data

Iwata, Tomoharu; Ishiguro, Katsuhiko

doi:10.1007/s10618-017-0509-y

Robust unsupervised cluster matching for network data

Published: 10 May 2017

Volume 31, pages 1132–1154, (2017)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Tomoharu Iwata¹ &
Katsuhiko Ishiguro^1,2

601 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

Unsupervised cluster matching is a task to find matching between clusters of objects in different domains. Examples include matching word clusters in different languages without dictionaries or parallel sentences and matching user communities across different friendship networks. Existing methods assume that every object is assigned into a cluster. However, in real-world applications, some objects would not form clusters. These irrelevant objects deteriorate the cluster matching performance since mistakenly estimated matching affect on estimation of matching of other objects. In this paper, we propose a probabilistic model for robust unsupervised cluster matching that discovers relevance of objects and matching of object clusters, simultaneously, given multiple networks. The proposed method finds correspondence only for relevant objects, and keeps irrelevant objects unmatched, which enables us to improve the matching performance since the adverse impact of irrelevant objects is eliminated. With the proposed method, relevant objects in different networks are clustered into a shared set of clusters by assuming that different networks are generated from a common network probabilistic model, which is an extension of stochastic block models. Objects assigned into the same clusters are considered as matched. Edges for irrelevant objects are assumed to be generated from a noise distribution irrespective of cluster assignments. We present an efficient Bayesian inference procedure of the proposed model based on collapsed Gibbs sampling. In our experiments, we demonstrate the effectiveness of the proposed method using synthetic and real-world data sets, including multilingual corpora and movie ratings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Available at http://www.cs.nyu.edu/~roweis/data.html.
Available at http://ai.stanford.edu/~gal/.

References

Airoldi E, Blei D, Fienberg S, Xing E (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014
MATH Google Scholar
Albert R, Barabási A (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47
Article MathSciNet Google Scholar
Barabási A, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Article MathSciNet Google Scholar
Blackwell D, MacQueen JB (1973) Ferguson distributions via Pólya urn schemes. Ann Stat 1(2):353–355
Article Google Scholar
Clauset A, Moore C, Newman M (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98–101
Article Google Scholar
Djuric N, Grbovic M, Vucetic S (2012) Convex kernelized sorting. In: Proceedings of the 26th AAAI conference on artificial intelligence
Gale WA, Church KW (1991) A program for aligning sentences in bilingual corpora. In: Proceedings of the 29th annual meeting on association for computational linguistics, pp 177–184
Girvan M, Newman M (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826
Article MathSciNet Google Scholar
Haghighi A, Liang P, Berg-Kirkpatrick T, Klein D (2008) Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-08: HLT, pp 771–779
Hoffman MD, Blei DM, Wang C, Paisley JW (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347
MathSciNet MATH Google Scholar
Ishiguro K, Iwata T, Ueda N, Tenenbaum J (2010) Dynamic infinite relational model for time-varying relational data analysis. Adv Neural Inf Process Syst 23:919–927
Google Scholar
Ishiguro K, Ueda N, Sawada H (2012) Subset infinite relational models. In: International conference on artificial intelligence and statistics, pp 547–555
Iwata T, Hirao T, Ueda N (2013) Unsupervised cluster matching via probabilistic latent variable models. In: Proceedings of the 27th AAAI conference on artificial intelligence
Iwata T, Lloyd J, Ghahramani Z (2016) Unsupervised many-to-many object matching for relational data. IEEE Trans Pattern Anal Mach Intell 38(3):607–619
Article Google Scholar
Kemp C, Tenenbaum J, Griffiths T, Yamada T, Ueda N (2006) Learning systems of concepts with an infinite relational model. In: Proceedings of the 20th AAAI conference on artificial intelligence, vol 21, p 381
Klami A (2012) Variational Bayesian matching. In: Proceedings of the 4th Asian conference on machine learning, pp 205–220
Klami A (2013) Bayesian object matching. Mach Learn 92:225–250
Article MathSciNet Google Scholar
Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the 12th international conference on machine learning, pp 331–339
Chapter Google Scholar
Li B, Yang Q, Xue X (2009) Transfer learning for collaborative filtering via a rating-matrix generative model. In: Proceedings of the 26th international conference on machine learning, pp 617–624
Miller K, Griffiths T, Jordan M (2009) Nonparametric latent feature models for link prediction. Adv Neural Inf Process Syst 22:1276–1284
Google Scholar
Nowicki K, Snijders T (2001) Estimation and prediction for stochastic blockstructures. J Am Stat Assoc 96(455):1077–1087
Article MathSciNet Google Scholar
Quadrianto N, Smola A, Song L, Tuytelaars T (2010) Kernelized sorting. IEEE Trans Pattern Anal Mach Intell 32(10):1809–1821
Article Google Scholar
Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th annual meeting on association for computational linguistics, pp 519–526
Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 966–973
Wang Y, Wong G (1987) Stochastic blockmodels for directed graphs. J Am Stat Assoc 82(397):8–19
Article MathSciNet Google Scholar
Watts D, Strogatz S (1998) Collective dynamics of ‘small-world’ networks. Nature 393:440–442
Article Google Scholar
Williamson S, Dubey A, Xing EP (2013) Parallel Markov Chain Monte Carlo for nonparametric mixture models. In: Proceedings of the 30th international conference on machine learning, pp 98–106
Yamada M, Sugiyama M (2011) Cross-domain object matching with model selection. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 807–815

Download references

Author information

Authors and Affiliations

NTT Communication Science Laboratories, 2-4 Hikaridai, Seikacho, Sorakugun, Kyoto, 619-0237, Japan
Tomoharu Iwata & Katsuhiko Ishiguro
Mirai Translate, Inc., Tokyo, Japan
Katsuhiko Ishiguro

Authors

Tomoharu Iwata
View author publications
You can also search for this author in PubMed Google Scholar
Katsuhiko Ishiguro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomoharu Iwata.

Additional information

Responsible editor: Jian Pei.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iwata, T., Ishiguro, K. Robust unsupervised cluster matching for network data. Data Min Knowl Disc 31, 1132–1154 (2017). https://doi.org/10.1007/s10618-017-0509-y

Download citation

Received: 09 August 2016
Accepted: 26 April 2017
Published: 10 May 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s10618-017-0509-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust unsupervised cluster matching for network data

Abstract

Access this article

Similar content being viewed by others

Semi-supervised Clustering on Heterogeneous Information Networks

Efficient Algorithms for Constrained Clustering with Side Information

Heterogeneous Information Networks Bi-clustering with Similarity Regularization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust unsupervised cluster matching for network data

Abstract

Access this article

Similar content being viewed by others

Semi-supervised Clustering on Heterogeneous Information Networks

Efficient Algorithms for Constrained Clustering with Side Information

Heterogeneous Information Networks Bi-clustering with Similarity Regularization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation