Skip to main content

S-SimRank: Combining Content and Link Information to Cluster Papers Effectively and Efficiently

  • Conference paper
Advanced Data Mining and Applications (ADMA 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5139))

Included in the following conference series:

Abstract

Both Content analysis and link analysis have its advantages in measuring relationships among documents. In this paper, we propose a new method to combine these two methods to compute the similarity of research papers so that we can do clustering of these papers more accurately. In order to improve the efficiency of similarity calculation, we develop a strategy to deal with the relationship graph separately without affecting the accuracy. We also design an approach to assign different weights to different links to the papers, which can enhance the accuracy of similarity calculation. The experimental results conducted on ACM Data Set show that our new algorithm, S-SimRank, outperforms other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Salton, G., Wong, A., Yang, C.S.: A vector space model for information retrieval. In: Communications of the ACM (1975)

    Google Scholar 

  2. Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: SIGKDD (2002)

    Google Scholar 

  3. Yin, X.X., Han, J.W., Yu., P.: Linkclus: Efficient clustering via heterogeneous semantic links. In: VLDB (2006)

    Google Scholar 

  4. Yin, X.X., Han, J.W., Yu., P.: Cross-relational clustering with user’s guidance. In: SIGKDD (2005)

    Google Scholar 

  5. Small, H.: Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science (1973)

    Google Scholar 

  6. Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation (1963)

    Google Scholar 

  7. Amsler, R.: Applications of citation-based automatic classification. Linguistic Research Center (1972)

    Google Scholar 

  8. Xue, G.R., Zeng, H.J., Chen, Z., Ma, W.Y., Yu, Y.: Similarity Spreading: A New Algorithm for Similarity Calculation of Interrelated Objects. In: WWW (2004)

    Google Scholar 

  9. Salton, G.: Associative document retrieval techniques using bibliographic information. Journal of the ACM (1963)

    Google Scholar 

  10. Wen, J.R., Nie, J.Y., Zhang, H.J.: Clustering user queries of a search engine. In: WWW (2001)

    Google Scholar 

  11. Bharat, K., Henzinger, M.: Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In: SIGIR (1998)

    Google Scholar 

  12. Chakrabarti, S.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In: WWW (1998)

    Google Scholar 

  13. Chakrabarti, S., Dom, B.E., Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.D.: Mining the Web’s link structure. Computer 32(8) (1999)

    Google Scholar 

  14. Huang, S., Xue, G.R., Zhang, B.Y., Chen, Z., Yu, Y., Ma, W.Y.: TSSP: A Reinforcement Algorithm to Find Related Papers. In: WI (2004)

    Google Scholar 

  15. Ganesan, P., Molina, H.G., Widom, J.: Exploiting hierarchical domain structure to compute similarity. In: ACM Transactions on Information Systems (2003)

    Google Scholar 

  16. Maguitman, A.G., Menczer, F., Roinestad, H.: Algorithmic Detection of Semantic Similarity. In: WWW (2005)

    Google Scholar 

  17. Lov´asz, L.: Random Walks on Graphs: A Survey, 2nd edn., pp. 1–46. Bolyai Society Mathematical Studies (1993)

    Google Scholar 

  18. Tong, H.H., Faloutsos, C., Pan, J.Y.: Fast RandomWalk with Restart and Its Applications. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065. Springer, Heidelberg (2006)

    Google Scholar 

  19. Song, R., Liu, H., Wen, H.F., Ma, J.R.,, W.Y.: Learning Block Importance Models for Web Pages. In: WWW (2004)

    Google Scholar 

  20. Han, J.W., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)

    Google Scholar 

  21. ACM dataset, http://portal.acm.org/portal.cfm

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cai, Y., Li, P., Liu, H., He, J., Du, X. (2008). S-SimRank: Combining Content and Link Information to Cluster Papers Effectively and Efficiently. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88192-6_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88191-9

  • Online ISBN: 978-3-540-88192-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics