A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph

Fogaras, Dániel; Rácz, Balázs

doi:10.1007/978-3-540-30192-9_55

Dániel Fogaras^21,22 &
Balázs Rácz^21,22

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3268))

Included in the following conference series:

International Conference on Extending Database Technology

1258 Accesses
9 Citations

Abstract

Several iterative hyperlink-based similarity measures were published to express the similarity of web pages. However, it usually seems hopeless to evaluate complex similarity functions over large repositories containing hundreds of millions of pages.We introduce scalable algorithms computing SimRank scores, which express the contextual similarities of pages based on the hyperlink structure. The proposed methods scale well to large repositories, fulfilling strict requirements about computational complexity. The algorithms were tested on a set of ten million pages, but parallelization techniques make it possible to compute the SimRank scores even for the entire web with over 4 billion pages. The key idea is that randomized Monte Carlo methods combined with indexing techniques yield a scalable approximation of SimRank.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berkhin, P.. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA (2002)
Google Scholar
Brewer, E.: Lessons from giant-scale services
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998)
Article Google Scholar
Broder, A.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences 1997, p. 21. IEEE Computer Society Press, Los Alamitos (1997)
Google Scholar
Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: Proceedings of the ICDE Conference (2002)
Google Scholar
Chen, Y.Y., Gan, Q., Suel, T.: I/O-efficient techniques for computing PageRank. In: Proceedings of the eleventh international conference on Information and knowledge management, pp. 549–557. ACM Press, New York (2002)
Chapter Google Scholar
Cristo, M., Calado, P., de Moura, E.S., Ziviani, N., Ribeiro-Neto, B.: Link information as a similarity measure in web classification. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 43–55. Springer, Heidelberg (2003)
Chapter Google Scholar
Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks (Amsterdam, The Netherlands: 1999) 31(11-16), 1467–1479 (1999)
Google Scholar
Flake, G., Lawrence, S., Giles, C.L., Coetzee, F.: Self-organization of the web and identification of communities. IEEE Computer 35(3), 66–71 (2002)
Google Scholar
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the 11th World Wide Web Conference (WWW), pp. 432–442. ACM Press, New York (2002)
Chapter Google Scholar
Heintze, N.: Scalable document fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (November 1996)
Google Scholar
Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York (2002)
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Article MATH MathSciNet Google Scholar
Lu, W., Janssen, J., Milios, E., Japkowicz, N.: Node similarity in networked information spaces. In: Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research, p. 11. IBM Press (2001)
Google Scholar
Meyer, U., Sanders, P., Sibeyn, J.F.: Algorithms for Memory Hierarchies. LNCS, vol. 2625. Springer, Heidelberg (2003)
Book MATH Google Scholar
Open Directory Project (ODP), http://www.dmoz.org
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
Google Scholar
Palmer, C.R., Gibbons, P.B., Faloutsos, C.: ANF: A fast and scalable tool for data mining in massive graphs. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 81–90. ACM Press, New York (2002)
Chapter Google Scholar
Rosenthal, J.S.: Parallel computing and Monte Carlo algorithms. Far East J. Theor. Stat. 4, 207–236 (2000)
MATH MathSciNet Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: Compressing and indexing documents and images, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer and Automation Research Institute of the Hungarian Academy of Sciences,
Dániel Fogaras & Balázs Rácz
Budapest University of Technology and Economics,
Dániel Fogaras & Balázs Rácz

Authors

Dániel Fogaras
View author publications
You can also search for this author in PubMed Google Scholar
Balázs Rácz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sidonia Systems, Grubmühl 20, D-82131, Stockdorf, Germany
Wolfgang Lindner
Università di Milano, Italy
Marco Mesiti
Functional Genomics Center Zurich (FGCZ), UZH / ETH Zurich, Winterthurerstrasse 190, CH–8057, Zurich, Switzerland
Can Türker
Computer Science Department, University of Crete, GREECE, and, Institute of Computer Science, FORTH-ICS, Greece
Yannis Tzitzikas
Aristotle University of Thessaloniki,
Athena I. Vakali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fogaras, D., Rácz, B. (2004). A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_55

Download citation

DOI: https://doi.org/10.1007/978-3-540-30192-9_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23305-3
Online ISBN: 978-3-540-30192-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics