Abstract
Several iterative hyperlink-based similarity measures were published to express the similarity of web pages. However, it usually seems hopeless to evaluate complex similarity functions over large repositories containing hundreds of millions of pages.We introduce scalable algorithms computing SimRank scores, which express the contextual similarities of pages based on the hyperlink structure. The proposed methods scale well to large repositories, fulfilling strict requirements about computational complexity. The algorithms were tested on a set of ten million pages, but parallelization techniques make it possible to compute the SimRank scores even for the entire web with over 4 billion pages. The key idea is that randomized Monte Carlo methods combined with indexing techniques yield a scalable approximation of SimRank.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Berkhin, P.. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA (2002)
Brewer, E.: Lessons from giant-scale services
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998)
Broder, A.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences 1997, p. 21. IEEE Computer Society Press, Los Alamitos (1997)
Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: Proceedings of the ICDE Conference (2002)
Chen, Y.Y., Gan, Q., Suel, T.: I/O-efficient techniques for computing PageRank. In: Proceedings of the eleventh international conference on Information and knowledge management, pp. 549–557. ACM Press, New York (2002)
Cristo, M., Calado, P., de Moura, E.S., Ziviani, N., Ribeiro-Neto, B.: Link information as a similarity measure in web classification. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 43–55. Springer, Heidelberg (2003)
Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks (Amsterdam, The Netherlands: 1999) 31(11-16), 1467–1479 (1999)
Flake, G., Lawrence, S., Giles, C.L., Coetzee, F.: Self-organization of the web and identification of communities. IEEE Computer 35(3), 66–71 (2002)
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the 11th World Wide Web Conference (WWW), pp. 432–442. ACM Press, New York (2002)
Heintze, N.: Scalable document fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (November 1996)
Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York (2002)
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Lu, W., Janssen, J., Milios, E., Japkowicz, N.: Node similarity in networked information spaces. In: Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research, p. 11. IBM Press (2001)
Meyer, U., Sanders, P., Sibeyn, J.F.: Algorithms for Memory Hierarchies. LNCS, vol. 2625. Springer, Heidelberg (2003)
Open Directory Project (ODP), http://www.dmoz.org
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
Palmer, C.R., Gibbons, P.B., Faloutsos, C.: ANF: A fast and scalable tool for data mining in massive graphs. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 81–90. ACM Press, New York (2002)
Rosenthal, J.S.: Parallel computing and Monte Carlo algorithms. Far East J. Theor. Stat. 4, 207–236 (2000)
Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: Compressing and indexing documents and images, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fogaras, D., Rácz, B. (2004). A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_55
Download citation
DOI: https://doi.org/10.1007/978-3-540-30192-9_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23305-3
Online ISBN: 978-3-540-30192-9
eBook Packages: Computer ScienceComputer Science (R0)