A New Approach for Verifying URL Uniqueness in Web Crawlers

Henrique, Wallace Favoreto; Ziviani, Nivio; Cristo, Marco Antônio; de Moura, Edleno Silva; da Silva, Altigran Soares; Carvalho, Cristiano

doi:10.1007/978-3-642-24583-1_23

Wallace Favoreto Henrique¹⁸,
Nivio Ziviani¹⁸,
Marco Antônio Cristo¹⁹,
Edleno Silva de Moura¹⁹,
Altigran Soares da Silva¹⁹ &
…
Cristiano Carvalho¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7024))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

767 Accesses

Abstract

The Web has become a huge repository of pages and search engines allow users to find relevant information in this repository. Web crawlers are an important component of search engines. They find, download, parse content and store pages in a repository. In this paper, we present a new algorithm for verifying URL uniqueness in a large-scale web crawler. The verifier of uniqueness must check if a URL is present in the repository of unique URLs and if the corresponding page was already collected. The algorithm is based on a novel policy for organizing the set of unique URLs according to the server they belong to, exploiting a locality of reference property. This property is inherent in Web traversals, which follows from the skewed distribution of links within a web page, thus favoring references to other pages from a same server. We select the URLs to be crawled taking into account information about the servers they belong to, thus allowing the usage of our algorithm in the crawler without extra cost to pre-organize the entries. We compare our algorithm with a state-of-the-art algorithm found in the literature. We present a model for both algorithms and compare their performances. We carried out experiments using a crawling simulation of a representative subset of the Web which show that the adopted policy yields to a significant improvement in the time spent handling URL uniqueness verification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Web Archive Profiling Through CDX Summarization

Web archive profiling through CDX summarization

Article 16 July 2016

A File-Based Linked Data Fragments Approach to Prefix Search

References

Berlt, K., Moura, E., Carvalho, A., Cristo, M., Ziviani, N., Couto, T.: Modeling the web as a hypergraph to compute page reputation. Information Systems 35(5), 530–543 (2010)
Article Google Scholar
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3(3), 1–34 (2009)
Article Google Scholar
Najork, M., Heydon, A.: High-performance web crawling. Technical report, SRC Research Report 173, Compaq Systems Research, Palo Alto, CA (2001)
Google Scholar
Pinkerton, B.: Finding what people want: Experiences with the web crawler. In: WWW, pp. 30–40 (1994)
Google Scholar
Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: ICDE, pp. 357–368 (2002)
Google Scholar
Xue, G.-R., Yang, Q., Zeng, H.-J., Yu, Y., Chen, Z.: Exploiting the hierarchical structure for link analysis. In: SIGIR, pp. 186–193 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Wallace Favoreto Henrique, Nivio Ziviani & Cristiano Carvalho
Department of Computer Science, Universidade Federal do Amazonas, Manaus, Brazil
Marco Antônio Cristo, Edleno Silva de Moura & Altigran Soares da Silva

Authors

Wallace Favoreto Henrique
View author publications
You can also search for this author in PubMed Google Scholar
Nivio Ziviani
View author publications
You can also search for this author in PubMed Google Scholar
Marco Antônio Cristo
View author publications
You can also search for this author in PubMed Google Scholar
Edleno Silva de Moura
View author publications
You can also search for this author in PubMed Google Scholar
Altigran Soares da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Cristiano Carvalho
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Università di Pisa, Italy
Roberto Grossi
Consiglio Nazionale delle Ricerche, Area della Ricerca di Pisa, Istituto di Scienza e Tecnologia dell’Informazione “Alessandro Faedo”, Via Giuseppe Moruzzi 1, 56124, Pisa, Italy
Fabrizio Sebastiani & Fabrizio Silvestri &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Henrique, W.F., Ziviani, N., Cristo, M.A., de Moura, E.S., da Silva, A.S., Carvalho, C. (2011). A New Approach for Verifying URL Uniqueness in Web Crawlers. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds) String Processing and Information Retrieval. SPIRE 2011. Lecture Notes in Computer Science, vol 7024. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24583-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-24583-1_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24582-4
Online ISBN: 978-3-642-24583-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics