S-SimRank: Combining Content and Link Information to Cluster Papers Effectively and Efficiently

Cai, Yuanzhe; Li, Pei; Liu, Hongyan; He, Jun; Du, Xiaoyong

doi:10.1007/978-3-540-88192-6_30

Yuanzhe Cai^6,7,
Pei Li^6,7,
Hongyan Liu⁸,
Jun He^6,7 &
…
Xiaoyong Du^6,7

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5139))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2579 Accesses
9 Citations

Abstract

Both Content analysis and link analysis have its advantages in measuring relationships among documents. In this paper, we propose a new method to combine these two methods to compute the similarity of research papers so that we can do clustering of these papers more accurately. In order to improve the efficiency of similarity calculation, we develop a strategy to deal with the relationship graph separately without affecting the accuracy. We also design an approach to assign different weights to different links to the papers, which can enhance the accuracy of similarity calculation. The experimental results conducted on ACM Data Set show that our new algorithm, S-SimRank, outperforms other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Salton, G., Wong, A., Yang, C.S.: A vector space model for information retrieval. In: Communications of the ACM (1975)
Google Scholar
Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: SIGKDD (2002)
Google Scholar
Yin, X.X., Han, J.W., Yu., P.: Linkclus: Efficient clustering via heterogeneous semantic links. In: VLDB (2006)
Google Scholar
Yin, X.X., Han, J.W., Yu., P.: Cross-relational clustering with user’s guidance. In: SIGKDD (2005)
Google Scholar
Small, H.: Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science (1973)
Google Scholar
Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation (1963)
Google Scholar
Amsler, R.: Applications of citation-based automatic classification. Linguistic Research Center (1972)
Google Scholar
Xue, G.R., Zeng, H.J., Chen, Z., Ma, W.Y., Yu, Y.: Similarity Spreading: A New Algorithm for Similarity Calculation of Interrelated Objects. In: WWW (2004)
Google Scholar
Salton, G.: Associative document retrieval techniques using bibliographic information. Journal of the ACM (1963)
Google Scholar
Wen, J.R., Nie, J.Y., Zhang, H.J.: Clustering user queries of a search engine. In: WWW (2001)
Google Scholar
Bharat, K., Henzinger, M.: Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In: SIGIR (1998)
Google Scholar
Chakrabarti, S.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In: WWW (1998)
Google Scholar
Chakrabarti, S., Dom, B.E., Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.D.: Mining the Web’s link structure. Computer 32(8) (1999)
Google Scholar
Huang, S., Xue, G.R., Zhang, B.Y., Chen, Z., Yu, Y., Ma, W.Y.: TSSP: A Reinforcement Algorithm to Find Related Papers. In: WI (2004)
Google Scholar
Ganesan, P., Molina, H.G., Widom, J.: Exploiting hierarchical domain structure to compute similarity. In: ACM Transactions on Information Systems (2003)
Google Scholar
Maguitman, A.G., Menczer, F., Roinestad, H.: Algorithmic Detection of Semantic Similarity. In: WWW (2005)
Google Scholar
Lov´asz, L.: Random Walks on Graphs: A Survey, 2nd edn., pp. 1–46. Bolyai Society Mathematical Studies (1993)
Google Scholar
Tong, H.H., Faloutsos, C., Pan, J.Y.: Fast RandomWalk with Restart and Its Applications. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065. Springer, Heidelberg (2006)
Google Scholar
Song, R., Liu, H., Wen, H.F., Ma, J.R.,, W.Y.: Learning Block Importance Models for Web Pages. In: WWW (2004)
Google Scholar
Han, J.W., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)
Google Scholar
ACM dataset, http://portal.acm.org/portal.cfm

Download references

Author information

Authors and Affiliations

Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China
Yuanzhe Cai, Pei Li, Jun He & Xiaoyong Du
Department of Computer Science, Renmin University of China, China
Yuanzhe Cai, Pei Li, Jun He & Xiaoyong Du
Department of Management Science and Engineering, Tsinghua University, China
Hongyan Liu

Authors

Yuanzhe Cai
View author publications
You can also search for this author in PubMed Google Scholar
Pei Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jun He
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Sichuan University, 610065, Chengdu, China
Changjie Tang
Department of Computer Science, The University of Western Ontario, Canada
Charles X. Ling
School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
Faculty of Science & Engineering, York University, 355 Lumbers Building, M3J 1P3, Toronto, Ontario, Canada
Nick J. Cercone
School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, 4072, Queensland, Australia
Xue Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, Y., Li, P., Liu, H., He, J., Du, X. (2008). S-SimRank: Combining Content and Link Information to Cluster Papers Effectively and Efficiently. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_30

Download citation

DOI: https://doi.org/10.1007/978-3-540-88192-6_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88191-9
Online ISBN: 978-3-540-88192-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics