Analysing the Effectiveness of Crawlers on the Client-Side Hidden Web

Prieto, Víctor M.; Álvarez, Manuel; López-García, Rafael; Cacheda, Fidel

doi:10.1007/978-3-642-28795-4_17

Víctor M. Prieto⁶,
Manuel Álvarez⁶,
Rafael López-García⁶ &
…
Fidel Cacheda⁶

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 157))

534 Accesses
4 Citations

Abstract

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the “client-side” Hidden Web. To that end, we accomplish several tasks. First, we perform a thorough analysis of the different client-side technologies and the main features of the Web 2.0 pages in order to determine the initial levels of the aforementioned scale. Second, we submit a Web site whose purpose is to check what crawlers are capable of dealing with those technologies and features. Third, we propose several methods to evaluate the performance of the crawlers in the Web site and to classify them according to the levels of the scale. Fourth, we show the results of applying those methods to some OpenSource and commercial crawlers, as well as to the robots of the main Web search engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Álvarez, M., Pan, A., Raposo, J., Hidalgo, J.: Crawling Web Pages with Support for Client-Side Dynamism (2006)
Google Scholar
Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V.: Crawling the Content Hidden Behind Web Forms. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part II. LNCS, vol. 4706, pp. 322–333. Springer, Heidelberg (2007)
Chapter Google Scholar
Bergman, M.K.: The deep web: Surfacing hidden value (2000)
Google Scholar
Chellapilla, K., Maykov, A.: A taxonomy of javascript redirection spam. In: Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2007, pp. 81–88 (2007)
Google Scholar
Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy (2005)
Google Scholar
Khare, R., Cutting, D.: Nutch: A flexible and scalable open-source web search engine. Technical report (2004)
Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1, 1241–1252 (2008)
Google Scholar
Mesbah, A., Bozdag, E., van Deursen, A.: Crawling ajax by inferring user interface state changes. In: Web Engineering, ICWE 2008, pp. 122–134 (2008)
Google Scholar
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to heritrix, an archival quality web crawler. In: 4th International Web Archiving Workshop, IWAW 2004 (2004)
Google Scholar
Pavuk Web page (2011), http://www.pavuk.org/
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 129–138. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Teleport Web page (2011), http://www.tenmax.com/teleport/pro/home.html
Web Copier Pro Web page, http://www.maximumsoft.com/products/wc_pro/overview.html
Web2Disk Web page (2011), http://www.inspyder.com/products/Web2Disk/Default.aspx
Weideman, M., Schwenke, F.: The influence that JavaScript has on the visibility of a Website to search engines - a pilot study. Information Research 11(4) (July 2006)
Google Scholar
Wu, B., Davison, B.D.: Cloaking and redirection: A preliminary study (2005)
Google Scholar
Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proceedings of the 14th International World Wide Web Conference, pp. 820–829. ACM Press (2005)
Google Scholar
Wu, B., Davison, B.D.: Detecting semantic cloaking on the web. In: Proceedings of the 15th International World Wide Web Conference, pp. 819–828. ACM Press (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Communication Technologies, University of A Coruña, Campus de Elviña s/n, 15071, A Coruña, Spain
Víctor M. Prieto, Manuel Álvarez, Rafael López-García & Fidel Cacheda

Authors

Víctor M. Prieto
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Álvarez
View author publications
You can also search for this author in PubMed Google Scholar
Rafael López-García
View author publications
You can also search for this author in PubMed Google Scholar
Fidel Cacheda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Víctor M. Prieto .

Editor information

Editors and Affiliations

Faculty of Science, Department of Computing Science, University of Salamanca, Plaza de la Merced S/N, Salamanca, 37008, Spain
Juan M. Corchado Rodríguez
Escuela Universitaria de Informática, Universidad Pontificia de Salamanca, Compañía 5, Salamanca, 37002, Spain
Javier Bajo Pérez
Poznan University of Technology, Strzelecka 11, Poznan, 60-965, Poland
Paulina Golinska
Faculté des Sciences, Département de mathématiques, Université de Sherbrooke, 2500 boul. Université, Sherbrooke, J1K 2R1, Canada
Sylvain Giroux
ETSI Informática, Avda. Reina Mercedes, s/n, Sevilla, 41012, Spain
Rafael Corchuelo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prieto, V.M., Álvarez, M., López-García, R., Cacheda, F. (2012). Analysing the Effectiveness of Crawlers on the Client-Side Hidden Web. In: Rodríguez, J., Pérez, J., Golinska, P., Giroux, S., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28795-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-28795-4_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28794-7
Online ISBN: 978-3-642-28795-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics