Skip to main content

Pairwise Similarity for Cluster Ensemble Problem: Link-Based and Approximate Approaches

  • Chapter
Transactions on Large-Scale Data- and Knowledge-Centered Systems IX

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 7980))

Abstract

Cluster ensemble methods have emerged as powerful techniques, aggregating several input data clusterings to generate a single output clustering, with improved robustness and stability. In particular, link-based similarity techniques have recently been introduced with superior performance to the conventional co-association method. Their potential and applicability are, however limited due to the underlying time complexity. In light of such shortcoming, this paper presents two approximate approaches that mitigate the problem of time complexity: the approximate algorithm approach (Approximate SimRank Based Similarity matrix) and the approximate data approach (Prototype-based cluster ensemble model). The first approach involves decreasing the computational requirement of the existing link-based technique; the second reduces the size of the problem by finding a smaller, representative, approximate dataset, derived by a density-biased sampling technique. The advantages of both approximate approaches are empirically demonstrated over 22 datasets (both artificial and real data) and statistical comparisons of performance (with 95% confidence level) with three well-known validity criteria. Results obtained from these experiments suggest that approximate techniques can efficiently help scaling up the application of link-based similarity methods to wider range of data sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Appel, A.P., Paterlini, A.A., de Sousa, E.P.M., Traina, A.J.M., Traina Jr., C.: A density-biased sampling technique to improve cluster representativeness. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 366–373. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  2. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)

    Google Scholar 

  3. Boulis, C., Ostendorf, M.: Combining multiple clustering systems. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 63–74. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  4. Calado, P., Cristo, M., Gonçalves, M.A., de Moura, E.S., Ribeiro-Neto, B.A., Ziviani, N.: Link-based similarity measures for the classification of web documents. JASIST 57(2), 208–221 (2006)

    Article  Google Scholar 

  5. de Castro, L.N.: Immune Engineering: Development of Computational Tools Inspired by the Artificial Immune Systems. Ph.D. thesis, DCA - FEEC/UNICAMP, Campinas/SP, Brazil (2001)

    Google Scholar 

  6. Domeniconi, C., Al-Razgan, M.: Weighted cluster ensembles: Methods and analysis. ACM Transactions on Knowledge Discovery from Data 2(4), 1–40 (2009)

    Google Scholar 

  7. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience (November 2000)

    Google Scholar 

  8. Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: A cluster ensemble approach. In: Proceedings of International Conference on Machine Learning, pp. 186–193 (2003)

    Google Scholar 

  9. Fern, X.Z., Brodley, C.E.: Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of International Conference on Machine Learning, pp. 36–43 (2004)

    Google Scholar 

  10. Fred, A.: Finding consistent clusters in data partitions. In: Kittler, J., Roli, F. (eds.) MCS 2001. LNCS, vol. 2096, pp. 309–318. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  11. Fred, A.L.N., Jain, A.K.: Data clustering using evidence accumulation. In: International Conference on Pattern Recognition, pp. 276–280 (2002)

    Google Scholar 

  12. Fred, A.L.N., Jain, A.K.: Robust data clustering. In: International Conference on Pattern Recognition, pp. 128–136 (2003)

    Google Scholar 

  13. Fred, A.L.N., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6), 835–850 (2005)

    Article  Google Scholar 

  14. Fred, A.L.N., Jain, A.K.: Learning pairwise similarity for data clustering. In: International Conference on Pattern Recognition, pp. 925–928 (2006)

    Google Scholar 

  15. Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. In: Proceedings of International Conference on Data Engineering, pp. 341–352 (2005)

    Google Scholar 

  16. Iam-on, N., Boongoen, T., Garrett, S.: Refining pairwise similarity matrix for cluster ensemble problem with cluster relations. In: Boulicaut, J.-F., Berthold, M.R., Horváth, T. (eds.) DS 2008. LNCS (LNAI), vol. 5255, pp. 222–233. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  17. Jain, A.K., Law, M.H.C.: Data clustering: A user’s dilemma. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 1–10. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  18. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Survey 31(3), 264–323 (1999)

    Article  Google Scholar 

  19. Jeh, G., Widom, J.: Simrank: A measure of structural-context similarity. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543 (2002)

    Google Scholar 

  20. Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Transactions on VLSI Systems 7(1), 69–79 (1999)

    Article  Google Scholar 

  21. Karypis, G., Kumar, V.: Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel Distributed Computing 48(1), 96–129 (1998)

    Article  MathSciNet  Google Scholar 

  22. Kerdprasop, K., Kerdprasop, N., Sattayatham, P.: Density-biased clustering based on reservoir sampling. In: Proceedings of DEXA Workshops, pp. 1122–1126 (2005)

    Google Scholar 

  23. Klink, S., Reuther, P., Weber, A., Walter, B., Ley, M.: Analysing social networks within bibliographical data. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080, pp. 234–243. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  24. Kollios, G., Gunopulos, D., Koudas, N., Berchtold, S.: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering 15(5), 1170–1187 (2003)

    Article  Google Scholar 

  25. Kuncheva, L.I., Hadjitodorov, S.T.: Using diversity in cluster ensembles. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pp. 1214–1219 (2004)

    Google Scholar 

  26. Kuncheva, L.I., Vetrov, D.: Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1798–1808 (2006)

    Article  Google Scholar 

  27. Kyrgyzov, I.O., Maitre, H., Campedel, M.: A method of clustering combination applied to satellite image analysis. In: Proceedings of International Conference on Image Analysis and Processing, pp. 81–86 (2007)

    Google Scholar 

  28. Monti, S., Tamayo, P., Mesirov, J.P., Golub, T.R.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52(1-2), 91–118 (2003)

    Article  MATH  Google Scholar 

  29. Nguyen, N., Caruana, R.: Consensus clusterings. In: Proceedings of IEEE International Conference on Data Mining, pp. 607–612 (2007)

    Google Scholar 

  30. Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering. SIGMOD Records 29(2), 82–92 (2000)

    Article  Google Scholar 

  31. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971)

    Article  Google Scholar 

  32. Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)

    MathSciNet  Google Scholar 

  33. Swift, S., Tucker, A., Vinciotti, V., Martin, N., Orengo, C., Liu, X., Kellam, P.: Consensus clustering and functional interpretation of gene-expression data. Genome Biology 5, R94 (2004)

    Google Scholar 

  34. Topchy, A.P., Jain, A.K., Punch, W.F.: Combining multiple weak clusterings. In: Proceedings of IEEE International Conference on Data Mining, pp. 331–338 (2003)

    Google Scholar 

  35. Topchy, A.P., Jain, A.K., Punch, W.F.: A mixture model for clustering ensembles. In: Proceedings of SIAM International Conference on Data Mining, pp. 379–390 (2004)

    Google Scholar 

  36. Wolpert, D.H., Macready, W.G.: No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute (1995)

    Google Scholar 

  37. Xue, H., Chen, S., Yang, Q.: Discriminatively regularized least-squares classification. Pattern Recognition 42(1), 93–104 (2009)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Iam-On, N., Boongoen, T. (2013). Pairwise Similarity for Cluster Ensemble Problem: Link-Based and Approximate Approaches. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems IX. Lecture Notes in Computer Science, vol 7980. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40069-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40069-8_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40068-1

  • Online ISBN: 978-3-642-40069-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics