Skip to main content

Sampling the National Deep Web

  • Conference paper
Book cover Database and Expert Systems Applications (DEXA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6860))

Included in the following conference series:

Abstract

A huge portion of today’s Web consists of web pages filled with information from myriads of online databases. This part of the Web, known as the deep Web, is to date relatively unexplored and even major characteristics such as number of searchable databases on the Web or databases’ subject distribution are somewhat disputable. In this paper, we revisit a problem of deep Web characterization: how to estimate the total number of online databases on the Web? We propose the Host-IP clustering sampling method to address the drawbacks of existing approaches for deep Web characterization and report our findings based on the survey of Russian Web. Obtained estimates together with a proposed sampling technique could be useful for further studies to handle data in the deep Web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. April 2004 Web Server Survey (April 2004), http://news.netcraft.com/archives/2004/04/01/april_2004_web_server_survey.html

  2. DNS load balancing report (April 2004), http://www.securityspace.com/s_survey/data/man.200404/dnsmult.html

  3. Baeza-Yates, R., Castillo, C.: Crawling the infinite Web: five levels are enough. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 156–167. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  4. Baeza-Yates, R., Castillo, C., Efthimiadis, E.N.: Characterization of national Web domains. ACM Trans. Internet Technol. 7(2) (2007)

    Google Scholar 

  5. Baeza-Yates, R., Castillo, C., López, V.: Characteristics of the Web of Spain. Cybermetrics 9(1) (2005)

    Google Scholar 

  6. Bergman, M.: The deep Web: surfacing hidden value. Journal of Electronic Publishing 7(1) (2001)

    Google Scholar 

  7. Bharat, K., Broder, A.: A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst. 30(1-7), 379–388 (1998)

    Article  Google Scholar 

  8. Bharat, K., Broder, A., Dean, J., Henzinger, M.: A comparison of techniques to find mirrored hosts on the WWW. J. Am. Soc. Inf. Sci. 51(12), 1114–1122 (2000)

    Article  Google Scholar 

  9. Chang, K., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the Web: observations and implications. SIGMOD Rec. 33(3), 61–70 (2004)

    Article  Google Scholar 

  10. Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proc. of WebDB 2004 (2004)

    Google Scholar 

  11. Gomes, D., Silva, M.J.: Characterizing a national community web. ACM Trans. Internet Technol. 5(3), 508–531 (2005)

    Article  Google Scholar 

  12. O’Neill, E.T., McClain, P.D., Lavoie, B.F.: A methodology for sampling the World Wide Web. Annual Review of OCLC Research 1997 (1997)

    Google Scholar 

  13. Shestakov, D.: Deep Web: databases on the Web. In: Handbook of Research on Innovations in Database Technologies and Applications, pp. 581–588. IGI Global (2009)

    Google Scholar 

  14. Shestakov, D.: On building a search interface discovery system. In: Proceedings of VLDB Workshops 2009, pp. 114–125 (2009)

    Google Scholar 

  15. Shestakov, D.: Measuring the deep Web (2011) (submitted)

    Google Scholar 

  16. Shestakov, D., Salakoski, T.: On estimating the scale of national deep Web. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 780–789. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  17. Thompson, S.: Sampling. John Wiley & Sons, New York (1992)

    MATH  Google Scholar 

  18. Tolosa, G., Bordignon, F., Baeza-Yates, R., Castillo, C.: Characterization of the Argentinian Web. Cybermetrics 11(1) (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shestakov, D. (2011). Sampling the National Deep Web. In: Hameurlain, A., Liddle, S.W., Schewe, KD., Zhou, X. (eds) Database and Expert Systems Applications. DEXA 2011. Lecture Notes in Computer Science, vol 6860. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23088-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23088-2_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23087-5

  • Online ISBN: 978-3-642-23088-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics