Abstract
Both Content analysis and link analysis have its advantages in measuring relationships among documents. In this paper, we propose a new method to combine these two methods to compute the similarity of research papers so that we can do clustering of these papers more accurately. In order to improve the efficiency of similarity calculation, we develop a strategy to deal with the relationship graph separately without affecting the accuracy. We also design an approach to assign different weights to different links to the papers, which can enhance the accuracy of similarity calculation. The experimental results conducted on ACM Data Set show that our new algorithm, S-SimRank, outperforms other algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Salton, G., Wong, A., Yang, C.S.: A vector space model for information retrieval. In: Communications of the ACM (1975)
Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: SIGKDD (2002)
Yin, X.X., Han, J.W., Yu., P.: Linkclus: Efficient clustering via heterogeneous semantic links. In: VLDB (2006)
Yin, X.X., Han, J.W., Yu., P.: Cross-relational clustering with user’s guidance. In: SIGKDD (2005)
Small, H.: Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science (1973)
Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation (1963)
Amsler, R.: Applications of citation-based automatic classification. Linguistic Research Center (1972)
Xue, G.R., Zeng, H.J., Chen, Z., Ma, W.Y., Yu, Y.: Similarity Spreading: A New Algorithm for Similarity Calculation of Interrelated Objects. In: WWW (2004)
Salton, G.: Associative document retrieval techniques using bibliographic information. Journal of the ACM (1963)
Wen, J.R., Nie, J.Y., Zhang, H.J.: Clustering user queries of a search engine. In: WWW (2001)
Bharat, K., Henzinger, M.: Improved Algorithms for Topic Distillation in a Hyperlinked Environment. In: SIGIR (1998)
Chakrabarti, S.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In: WWW (1998)
Chakrabarti, S., Dom, B.E., Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.D.: Mining the Web’s link structure. Computer 32(8) (1999)
Huang, S., Xue, G.R., Zhang, B.Y., Chen, Z., Yu, Y., Ma, W.Y.: TSSP: A Reinforcement Algorithm to Find Related Papers. In: WI (2004)
Ganesan, P., Molina, H.G., Widom, J.: Exploiting hierarchical domain structure to compute similarity. In: ACM Transactions on Information Systems (2003)
Maguitman, A.G., Menczer, F., Roinestad, H.: Algorithmic Detection of Semantic Similarity. In: WWW (2005)
Lov´asz, L.: Random Walks on Graphs: A Survey, 2nd edn., pp. 1–46. Bolyai Society Mathematical Studies (1993)
Tong, H.H., Faloutsos, C., Pan, J.Y.: Fast RandomWalk with Restart and Its Applications. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065. Springer, Heidelberg (2006)
Song, R., Liu, H., Wen, H.F., Ma, J.R.,, W.Y.: Learning Block Importance Models for Web Pages. In: WWW (2004)
Han, J.W., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)
ACM dataset, http://portal.acm.org/portal.cfm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cai, Y., Li, P., Liu, H., He, J., Du, X. (2008). S-SimRank: Combining Content and Link Information to Cluster Papers Effectively and Efficiently. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_30
Download citation
DOI: https://doi.org/10.1007/978-3-540-88192-6_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88191-9
Online ISBN: 978-3-540-88192-6
eBook Packages: Computer ScienceComputer Science (R0)