Skip to main content

Scalable Link-Based Similarity Computation and Clustering

  • Chapter
  • First Online:
Link Mining: Models, Algorithms, and Applications

Abstract

Data objects in a relational database are cross-linked with each other via multi-typed links. Links contain rich semantic information that may indicate important relationships among objects, such as the similarities between objects. In this chapter we explore linkage-based clustering, in which the similarity between two objects is measured based on the similarities between the objects linked with them. We study a hierarchical structure called SimTree, which represents similarities in multi-granularity manner. This method avoids the high cost of computing and storing pairwise similarities but still thoroughly explore relationships among objects. We introduce an efficient algorithm for computing similarities utilizing the SimTree.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here conferences refer to conferences, journals, and workshops. We are only interested in productive authors and well-known conferences because it is easier to determine the research fields related to each of them, from which the accuracy of clustering will be judged.

  2. 2.

    Since no frequent patterns of conferences can be found using the proceedings linked to them, LinkClus uses authors linked with conferences to find frequent patterns of conferences, in order to build the initial SimTree for conferences.

  3. 3.

    We do not test SimRank and F-SimRank on large databases because they consume too much memory.

References

  1. C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In SIGMOD, Philadelphia, PA, 1999.

    Google Scholar 

  2. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, Washington, DC, 1993.

    Google Scholar 

  3. Y. Bartal. On approximating arbitrary metrics by tree metrics. In STOC, Dallas, TX, 1998.

    Google Scholar 

  4. R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In ICML, Bonn, Germany, 2005.

    Google Scholar 

  5. D. Chakrabarti, S. Papadimitriou, D. S. Modha, and C. Faloutsos. Fully automatic cross-associations. In KDD, Seattle, WA, 2004.

    Google Scholar 

  6. Y. Cheng and G. M. Church. Biclustering of expression data. In ISMB, La Jolla, CA, 2000.

    Google Scholar 

  7. DBLP Bibliography. www.informatik.uni-trier.de/∼ley/db/

  8. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In KDD, Washington, DC, 2003.

    Google Scholar 

  9. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the Internet topology. In SIGCOMM, Cambridge, MA, 1999.

    Google Scholar 

  10. D. Fogaras and B. Rácz. Scaling link-base similarity search. In WWW, Chiba, Japan, 2005.

    Google Scholar 

  11. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD, Seattle, WA, 1998.

    Google Scholar 

  12. J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-k frequent closed patterns without minimum support. In ICDM, Maebashi City, Japan, 2002.

    Google Scholar 

  13. G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. In KDD, Edmonton, Canada, 2002.

    Google Scholar 

  14. M. Kirsten and S. Wrobel. Relational distance-based clustering. In ILP, Madison, WI, 1998.

    Google Scholar 

  15. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium, Berkeley, CA, 1967.

    Google Scholar 

  16. R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In VLDB, Santiago de Chile, Chile, 1994.

    Google Scholar 

  17. R. Sibson. SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1):30–34, 1973.

    Article  Google Scholar 

  18. P.-N. Tan, M. Steinbach, and W. Kumar. Introdution to data mining. Addison-Wesley, New York, NY 2005.

    Google Scholar 

  19. J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In KDD, Washington, DC, 2003.

    Google Scholar 

  20. J. D. Wang, H. J. Zeng, Z. Chen, H. J. Lu, L. Tao, and W. Y. Ma. ReCoM: Reinforcement clustering of multi-type interrelated data objects. In SIGIR, Toronto, Canada, 2003.

    Google Scholar 

  21. X. Yin, J. Han, and P. S. Yu. Cross-relational clustering with user’s guidance. In KDD, Chicago, IL, 2005.

    Google Scholar 

  22. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD, Montreal, Canada, 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoxin Yin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Yin, X., Han, J., Yu, P.S. (2010). Scalable Link-Based Similarity Computation and Clustering. In: Yu, P., Han, J., Faloutsos, C. (eds) Link Mining: Models, Algorithms, and Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-6515-8_2

Download citation

Publish with us

Policies and ethics