Article

Bulk loading large collections of hyperlinked resources

Author:
Davood Rafiei

University of Alberta

University of Alberta
View Profile

HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermediaSeptember 2005Pages 267–269https://doi.org/10.1145/1083356.1083413

Published:06 September 2005Publication History

HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia

Pages 267–269

ABSTRACT

The problem of loading large collections of hyperlinked resources into a relational database is complicated with inter-node references when these references cannot be indexed. We show that this scenario can arise in many real life hyperlinked resources and propose several solutions to address the problem. We run some experiments over a graph of the Web with 178 million nodes and around 1 billion edges and report our results.

References

R. Albert and A. L. Barabasi. Statistical mechanics of complex networks. Rev. Mod. Phys., 74:47--94, 2002.Google ScholarDigital Library
Z. Bar-Yossef and S. Rajagoplan. Template detection via data mining and its applications. In Proc. of the WWW Conference, pages 580--591, 2002. Google ScholarDigital Library
FIPS. Secure hash standard. http://www.itl.nist.gov/fipspubs/fip180-1.htm.Google Scholar
H. Garcia-Molina, J. D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 2000. Google ScholarDigital Library
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. In Proc. of the WWW Conference, pages 213--225, 1999. Google ScholarDigital Library
A. Heydon and M. Najork. Mercator: a scalable, extensible web crawler. In Proc. of the WWW Conference, pages 219--229, 1999. Google ScholarDigital Library
D. E. Knuth. The Art of Computer Programming, volume 3. Addison Wesley, second edition, 1998. Google ScholarDigital Library
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting large-scale knowledge bases from the Web. In Proc. of the VLDB Conference, pages 639--650, 1999. Google ScholarDigital Library
M. O. Rabin. Fingerprinting by random polynomials. Report TR-15-81, Center for Research in Computing Technology, Harward University, 1981.Google Scholar
R. Rivest. Rfc 1321 - the MD5 message-digest algorithm. http://www.faqs.org/rfcs/rfc1321.htm. Google ScholarDigital Library
J. L. Wiener and J. F. Naughton. Oodb bulk loading revisited: The partitioned-list approach. In Proc. of the VLDB Conference, pages 30--41, 1995. Google ScholarDigital Library

Index Terms

Bulk loading large collections of hyperlinked resources
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
Random web crawls
WWW '07: Proceedings of the 16th international conference on World Wide Web

This paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of ...
Read More
Graph structure in the web: aggregated by pay-level domain
WebSci '14: Proceedings of the 2014 ACM conference on Web science

Previous research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
September 2005
310 pages
ISBN:1595931686
DOI:10.1145/1083356
General Chair:
Siegfried Reich
Salzburg Research, Austria
,
Program Chair:
Manolis Tzagarakis
Computer Technology Institute, Greece
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 September 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bulk loading
larges graphs
web graph
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate378of1,158submissions,33%
Upcoming Conference
HT '24

Sponsor:

sigweb

35th ACM Conference on Hypertext and Social Media

September 10 - 13, 2024

Poznan , Poland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 163
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bulk loading large collections of hyperlinked resources

HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

Random web crawls

Graph structure in the web: aggregated by pay-level domain