Skip to main content

Nautilus: A Generic Framework for Crawling Deep Web

  • Conference paper
Data and Knowledge Engineering (ICDKE 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7696))

Included in the following conference series:

Abstract

This paper presents Nautilus, which is a generic framework for crawling deep Web. We provide an abstraction of deep Web crawling process and mechanism of integrating heterogeneous business modules. A Federal Decentralized Architecture is proposed to ensemble advantages of existed P2P networking architectures. We also present effective policies to schedule crawling tasks. Experimental results show our scheduling policies have good performance on load-balance and overall throughput.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 72.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Jayant, M., David, K., Łucja, K., et al.: Google’s Deep Web crawl. PVLDB 1(2), 1241–1252 (2008)

    Google Scholar 

  2. Wei, L., Xiaofeng, M., Weiyi, M.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)

    Article  Google Scholar 

  3. Luciano, B., Juliana, F.: An adaptive crawler for locating hidden-Web entry points. In: Proceedings of the 16th International Conference on World Wide Web (2007)

    Google Scholar 

  4. Luciano, B., Juliana, F.: Searching for Hidden-Web Databases. In: Proceedings of WebDB (2005)

    Google Scholar 

  5. Karane, V., Luciano, B., Juliana, F., Altigran, S.: Siphon++: a hidden-web crawler for keyword-based interfaces. In. In: Proceedings of the 17th ACM Conference on Information and knowledge Management (2008)

    Google Scholar 

  6. Jon, K.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999, 2001)

    Article  MathSciNet  MATH  Google Scholar 

  7. Soumen, C., Kunal, P., Mallela, S.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web (2002)

    Google Scholar 

  8. Junghoo, C., Hector, G.M.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)

    Article  Google Scholar 

  9. Ricardo, B.Y., Carlos, C.: Balancing volume, quality and freshness in web crawling. In: Soft Computing Systems - Design, Management and Applications, Santiago, Chile (2002)

    Google Scholar 

  10. Robert, M., Krishna, B.: SPHINX: a framework for creating personal, site-specific Web crawlers. Computer Networks and ISDN Systems 30, 119–130 (1998)

    Article  Google Scholar 

  11. Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality Web crawler. In: The 4th International Web Archiving Workshop, Bath, UK (2004)

    Google Scholar 

  12. Allan, H., Marc, N.: Mercator: A scalable, extensible Web crawler. World Wide Web 2(4), 219–229 (1999)

    Article  Google Scholar 

  13. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a scalable fully distributed Web crawler. Softw. Pract. Exper. 34(8), 711–726 (2004)

    Article  Google Scholar 

  14. Ntoulas, A., Pzerfos, P., Junghoo, C.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (2005)

    Google Scholar 

  15. Ion, S., Robert, M., David, K., Frans Kaashoek, M., Hari, B.: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM Comput. Commun. Rev. 31(4), 149–160 (2001)

    Article  Google Scholar 

  16. David, K., Eric, L., Tom, L., Rina, P., Matthew, L., Daniel, L.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhao, J., Wang, P. (2012). Nautilus: A Generic Framework for Crawling Deep Web. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds) Data and Knowledge Engineering. ICDKE 2012. Lecture Notes in Computer Science, vol 7696. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34679-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34679-8_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34678-1

  • Online ISBN: 978-3-642-34679-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics