Skip to main content

Towards a Quality-Oriented Real-Time Web Crawler

  • Conference paper
Book cover Web Information Systems and Mining (WISM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6318))

Included in the following conference series:

Abstract

Real-time search emerges as a significant amount of time-sensitive information is produced online every minute. Rather than most commercial web sites having routine content publish schedules, online users deliver their postings on web communities with high variance in both temporality and quality. In this work, we address the scheduling problem for web crawlers, with the objective of optimizing the quality of the local index (i.e. minimizing the total weighted delays of postings) with the given quantity of resources. Towards this, we utilize the posting importance evaluation mechanism and the underlying publish pattern of data source to exploit a posting weights generation prediction model, which is leveraged to help web crawler decide the retrieval points for better index quality. From extensive experiments applied on several web communities, we show the effectiveness of our policy outperforms uniform scheduling and the one purely based upon posting generation pattern.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bright, L., Gal, A., Raschid, L.: Adaptive Pull-Based Data Freshness Policies for Diverse Update Patterns.Technical Report, UMIACSTR-2004-01, University of Maryland

    Google Scholar 

  2. PubSubHubbub protocol, http://code.google.com/p/pubsubhubbub/

  3. Chen, Z., Zhang, L., Wang, W.: PostingRank: Bringing Order to Web Forum Postings. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 377–384. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  4. Cho, J., Garcia-Molina, H.: Synchronizing a database to Improve Freshness. In: SIDMOD Conference (2000)

    Google Scholar 

  5. Cho, J., Garcia-Molina, H.: Effective Page Refresh Policies for Web Crawlers. ACM TODS 28(4) (2003)

    Google Scholar 

  6. Cho, J., Garcia-Molina, H.: Estimating Frequency of Change. ACM Transactions on Internet Technology 3(3) (2003)

    Google Scholar 

  7. Sia, K.C., Cho, J., Cho, H.-K.: Efficient Monitoring Algorithm for Fast News Alerts. IEEE Transactions on Knowledge and Data Engineering 19(7) (2007)

    Google Scholar 

  8. Xu, J., Li, Q., Qu, H., Labrinidis, A.: Towards a Content-Proivder-Friendly Web Page Crawler. In: Proceedings of the Tenth International ACM Workshop on the Web and Database (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sun, J., Gao, H., Yang, X. (2010). Towards a Quality-Oriented Real-Time Web Crawler. In: Wang, F.L., Gong, Z., Luo, X., Lei, J. (eds) Web Information Systems and Mining. WISM 2010. Lecture Notes in Computer Science, vol 6318. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16515-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16515-3_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16514-6

  • Online ISBN: 978-3-642-16515-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics