Skip to main content

A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph

  • Conference paper
Current Trends in Database Technology - EDBT 2004 Workshops (EDBT 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3268))

Included in the following conference series:

Abstract

Several iterative hyperlink-based similarity measures were published to express the similarity of web pages. However, it usually seems hopeless to evaluate complex similarity functions over large repositories containing hundreds of millions of pages.We introduce scalable algorithms computing SimRank scores, which express the contextual similarities of pages based on the hyperlink structure. The proposed methods scale well to large repositories, fulfilling strict requirements about computational complexity. The algorithms were tested on a set of ten million pages, but parallelization techniques make it possible to compute the SimRank scores even for the entire web with over 4 billion pages. The key idea is that randomized Monte Carlo methods combined with indexing techniques yield a scalable approximation of SimRank.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berkhin, P.. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA (2002)

    Google Scholar 

  2. Brewer, E.: Lessons from giant-scale services

    Google Scholar 

  3. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998)

    Article  Google Scholar 

  4. Broder, A.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences 1997, p. 21. IEEE Computer Society Press, Los Alamitos (1997)

    Google Scholar 

  5. Bruno, N., Gravano, L., Marian, A.: Evaluating top-k queries over web-accessible databases. In: Proceedings of the ICDE Conference (2002)

    Google Scholar 

  6. Chen, Y.Y., Gan, Q., Suel, T.: I/O-efficient techniques for computing PageRank. In: Proceedings of the eleventh international conference on Information and knowledge management, pp. 549–557. ACM Press, New York (2002)

    Chapter  Google Scholar 

  7. Cristo, M., Calado, P., de Moura, E.S., Ziviani, N., Ribeiro-Neto, B.: Link information as a similarity measure in web classification. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 43–55. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  8. Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks (Amsterdam, The Netherlands: 1999) 31(11-16), 1467–1479 (1999)

    Google Scholar 

  9. Flake, G., Lawrence, S., Giles, C.L., Coetzee, F.: Self-organization of the web and identification of communities. IEEE Computer 35(3), 66–71 (2002)

    Google Scholar 

  10. Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the 11th World Wide Web Conference (WWW), pp. 432–442. ACM Press, New York (2002)

    Chapter  Google Scholar 

  11. Heintze, N.: Scalable document fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (November 1996)

    Google Scholar 

  12. Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York (2002)

    Google Scholar 

  13. Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  14. Lu, W., Janssen, J., Milios, E., Japkowicz, N.: Node similarity in networked information spaces. In: Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research, p. 11. IBM Press (2001)

    Google Scholar 

  15. Meyer, U., Sanders, P., Sibeyn, J.F.: Algorithms for Memory Hierarchies. LNCS, vol. 2625. Springer, Heidelberg (2003)

    Book  MATH  Google Scholar 

  16. Open Directory Project (ODP), http://www.dmoz.org

  17. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)

    Google Scholar 

  18. Palmer, C.R., Gibbons, P.B., Faloutsos, C.: ANF: A fast and scalable tool for data mining in massive graphs. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 81–90. ACM Press, New York (2002)

    Chapter  Google Scholar 

  19. Rosenthal, J.S.: Parallel computing and Monte Carlo algorithms. Far East J. Theor. Stat. 4, 207–236 (2000)

    MATH  MathSciNet  Google Scholar 

  20. Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: Compressing and indexing documents and images, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fogaras, D., Rácz, B. (2004). A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30192-9_55

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23305-3

  • Online ISBN: 978-3-540-30192-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics