Skip to main content
Log in

Robust unsupervised cluster matching for network data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Unsupervised cluster matching is a task to find matching between clusters of objects in different domains. Examples include matching word clusters in different languages without dictionaries or parallel sentences and matching user communities across different friendship networks. Existing methods assume that every object is assigned into a cluster. However, in real-world applications, some objects would not form clusters. These irrelevant objects deteriorate the cluster matching performance since mistakenly estimated matching affect on estimation of matching of other objects. In this paper, we propose a probabilistic model for robust unsupervised cluster matching that discovers relevance of objects and matching of object clusters, simultaneously, given multiple networks. The proposed method finds correspondence only for relevant objects, and keeps irrelevant objects unmatched, which enables us to improve the matching performance since the adverse impact of irrelevant objects is eliminated. With the proposed method, relevant objects in different networks are clustered into a shared set of clusters by assuming that different networks are generated from a common network probabilistic model, which is an extension of stochastic block models. Objects assigned into the same clusters are considered as matched. Edges for irrelevant objects are assumed to be generated from a noise distribution irrespective of cluster assignments. We present an efficient Bayesian inference procedure of the proposed model based on collapsed Gibbs sampling. In our experiments, we demonstrate the effectiveness of the proposed method using synthetic and real-world data sets, including multilingual corpora and movie ratings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Available at http://www.cs.nyu.edu/~roweis/data.html.

  2. Available at http://ai.stanford.edu/~gal/.

References

  • Airoldi E, Blei D, Fienberg S, Xing E (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014

    MATH  Google Scholar 

  • Albert R, Barabási A (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47

    Article  MathSciNet  Google Scholar 

  • Barabási A, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512

    Article  MathSciNet  Google Scholar 

  • Blackwell D, MacQueen JB (1973) Ferguson distributions via Pólya urn schemes. Ann Stat 1(2):353–355

    Article  Google Scholar 

  • Clauset A, Moore C, Newman M (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98–101

    Article  Google Scholar 

  • Djuric N, Grbovic M, Vucetic S (2012) Convex kernelized sorting. In: Proceedings of the 26th AAAI conference on artificial intelligence

  • Gale WA, Church KW (1991) A program for aligning sentences in bilingual corpora. In: Proceedings of the 29th annual meeting on association for computational linguistics, pp 177–184

  • Girvan M, Newman M (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826

    Article  MathSciNet  Google Scholar 

  • Haghighi A, Liang P, Berg-Kirkpatrick T, Klein D (2008) Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-08: HLT, pp 771–779

  • Hoffman MD, Blei DM, Wang C, Paisley JW (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347

    MathSciNet  MATH  Google Scholar 

  • Ishiguro K, Iwata T, Ueda N, Tenenbaum J (2010) Dynamic infinite relational model for time-varying relational data analysis. Adv Neural Inf Process Syst 23:919–927

    Google Scholar 

  • Ishiguro K, Ueda N, Sawada H (2012) Subset infinite relational models. In: International conference on artificial intelligence and statistics, pp 547–555

  • Iwata T, Hirao T, Ueda N (2013) Unsupervised cluster matching via probabilistic latent variable models. In: Proceedings of the 27th AAAI conference on artificial intelligence

  • Iwata T, Lloyd J, Ghahramani Z (2016) Unsupervised many-to-many object matching for relational data. IEEE Trans Pattern Anal Mach Intell 38(3):607–619

    Article  Google Scholar 

  • Kemp C, Tenenbaum J, Griffiths T, Yamada T, Ueda N (2006) Learning systems of concepts with an infinite relational model. In: Proceedings of the 20th AAAI conference on artificial intelligence, vol 21, p 381

  • Klami A (2012) Variational Bayesian matching. In: Proceedings of the 4th Asian conference on machine learning, pp 205–220

  • Klami A (2013) Bayesian object matching. Mach Learn 92:225–250

    Article  MathSciNet  Google Scholar 

  • Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the 12th international conference on machine learning, pp 331–339

    Chapter  Google Scholar 

  • Li B, Yang Q, Xue X (2009) Transfer learning for collaborative filtering via a rating-matrix generative model. In: Proceedings of the 26th international conference on machine learning, pp 617–624

  • Miller K, Griffiths T, Jordan M (2009) Nonparametric latent feature models for link prediction. Adv Neural Inf Process Syst 22:1276–1284

    Google Scholar 

  • Nowicki K, Snijders T (2001) Estimation and prediction for stochastic blockstructures. J Am Stat Assoc 96(455):1077–1087

    Article  MathSciNet  Google Scholar 

  • Quadrianto N, Smola A, Song L, Tuytelaars T (2010) Kernelized sorting. IEEE Trans Pattern Anal Mach Intell 32(10):1809–1821

    Article  Google Scholar 

  • Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th annual meeting on association for computational linguistics, pp 519–526

  • Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 966–973

  • Wang Y, Wong G (1987) Stochastic blockmodels for directed graphs. J Am Stat Assoc 82(397):8–19

    Article  MathSciNet  Google Scholar 

  • Watts D, Strogatz S (1998) Collective dynamics of ‘small-world’ networks. Nature 393:440–442

    Article  Google Scholar 

  • Williamson S, Dubey A, Xing EP (2013) Parallel Markov Chain Monte Carlo for nonparametric mixture models. In: Proceedings of the 30th international conference on machine learning, pp 98–106

  • Yamada M, Sugiyama M (2011) Cross-domain object matching with model selection. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 807–815

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomoharu Iwata.

Additional information

Responsible editor: Jian Pei.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Iwata, T., Ishiguro, K. Robust unsupervised cluster matching for network data. Data Min Knowl Disc 31, 1132–1154 (2017). https://doi.org/10.1007/s10618-017-0509-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-017-0509-y

Keywords

Navigation