Skip to main content

Analysing the Effectiveness of Crawlers on the Client-Side Hidden Web

  • Conference paper
Trends in Practical Applications of Agents and Multiagent Systems

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 157))

Abstract

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the “client-side” Hidden Web. To that end, we accomplish several tasks. First, we perform a thorough analysis of the different client-side technologies and the main features of the Web 2.0 pages in order to determine the initial levels of the aforementioned scale. Second, we submit a Web site whose purpose is to check what crawlers are capable of dealing with those technologies and features. Third, we propose several methods to evaluate the performance of the crawlers in the Web site and to classify them according to the levels of the scale. Fourth, we show the results of applying those methods to some OpenSource and commercial crawlers, as well as to the robots of the main Web search engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Álvarez, M., Pan, A., Raposo, J., Hidalgo, J.: Crawling Web Pages with Support for Client-Side Dynamism (2006)

    Google Scholar 

  2. Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V.: Crawling the Content Hidden Behind Web Forms. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part II. LNCS, vol. 4706, pp. 322–333. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Bergman, M.K.: The deep web: Surfacing hidden value (2000)

    Google Scholar 

  4. Chellapilla, K., Maykov, A.: A taxonomy of javascript redirection spam. In: Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2007, pp. 81–88 (2007)

    Google Scholar 

  5. Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy (2005)

    Google Scholar 

  6. Khare, R., Cutting, D.: Nutch: A flexible and scalable open-source web search engine. Technical report (2004)

    Google Scholar 

  7. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1, 1241–1252 (2008)

    Google Scholar 

  8. Mesbah, A., Bozdag, E., van Deursen, A.: Crawling ajax by inferring user interface state changes. In: Web Engineering, ICWE 2008, pp. 122–134 (2008)

    Google Scholar 

  9. Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to heritrix, an archival quality web crawler. In: 4th International Web Archiving Workshop, IWAW 2004 (2004)

    Google Scholar 

  10. Pavuk Web page (2011), http://www.pavuk.org/

  11. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 129–138. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  12. Teleport Web page (2011), http://www.tenmax.com/teleport/pro/home.html

  13. Web Copier Pro Web page, http://www.maximumsoft.com/products/wc_pro/overview.html

  14. Web2Disk Web page (2011), http://www.inspyder.com/products/Web2Disk/Default.aspx

  15. Weideman, M., Schwenke, F.: The influence that JavaScript has on the visibility of a Website to search engines - a pilot study. Information Research 11(4) (July 2006)

    Google Scholar 

  16. Wu, B., Davison, B.D.: Cloaking and redirection: A preliminary study (2005)

    Google Scholar 

  17. Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proceedings of the 14th International World Wide Web Conference, pp. 820–829. ACM Press (2005)

    Google Scholar 

  18. Wu, B., Davison, B.D.: Detecting semantic cloaking on the web. In: Proceedings of the 15th International World Wide Web Conference, pp. 819–828. ACM Press (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Víctor M. Prieto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Prieto, V.M., Álvarez, M., López-García, R., Cacheda, F. (2012). Analysing the Effectiveness of Crawlers on the Client-Side Hidden Web. In: Rodríguez, J., Pérez, J., Golinska, P., Giroux, S., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28795-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28795-4_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28794-7

  • Online ISBN: 978-3-642-28795-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics