ABSTRACT
The problem of loading large collections of hyperlinked resources into a relational database is complicated with inter-node references when these references cannot be indexed. We show that this scenario can arise in many real life hyperlinked resources and propose several solutions to address the problem. We run some experiments over a graph of the Web with 178 million nodes and around 1 billion edges and report our results.
- R. Albert and A. L. Barabasi. Statistical mechanics of complex networks. Rev. Mod. Phys., 74:47--94, 2002.Google ScholarDigital Library
- Z. Bar-Yossef and S. Rajagoplan. Template detection via data mining and its applications. In Proc. of the WWW Conference, pages 580--591, 2002. Google ScholarDigital Library
- FIPS. Secure hash standard. http://www.itl.nist.gov/fipspubs/fip180-1.htm.Google Scholar
- H. Garcia-Molina, J. D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 2000. Google ScholarDigital Library
- M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. In Proc. of the WWW Conference, pages 213--225, 1999. Google ScholarDigital Library
- A. Heydon and M. Najork. Mercator: a scalable, extensible web crawler. In Proc. of the WWW Conference, pages 219--229, 1999. Google ScholarDigital Library
- D. E. Knuth. The Art of Computer Programming, volume 3. Addison Wesley, second edition, 1998. Google ScholarDigital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting large-scale knowledge bases from the Web. In Proc. of the VLDB Conference, pages 639--650, 1999. Google ScholarDigital Library
- M. O. Rabin. Fingerprinting by random polynomials. Report TR-15-81, Center for Research in Computing Technology, Harward University, 1981.Google Scholar
- R. Rivest. Rfc 1321 - the MD5 message-digest algorithm. http://www.faqs.org/rfcs/rfc1321.htm. Google ScholarDigital Library
- J. L. Wiener and J. F. Naughton. Oodb bulk loading revisited: The partitioned-list approach. In Proc. of the VLDB Conference, pages 30--41, 1995. Google ScholarDigital Library
Index Terms
- Bulk loading large collections of hyperlinked resources
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Random web crawls
WWW '07: Proceedings of the 16th international conference on World Wide WebThis paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of ...
Graph structure in the web: aggregated by pay-level domain
WebSci '14: Proceedings of the 2014 ACM conference on Web sciencePrevious research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional ...
Comments