Abstract
Real-time search emerges as a significant amount of time-sensitive information is produced online every minute. Rather than most commercial web sites having routine content publish schedules, online users deliver their postings on web communities with high variance in both temporality and quality. In this work, we address the scheduling problem for web crawlers, with the objective of optimizing the quality of the local index (i.e. minimizing the total weighted delays of postings) with the given quantity of resources. Towards this, we utilize the posting importance evaluation mechanism and the underlying publish pattern of data source to exploit a posting weights generation prediction model, which is leveraged to help web crawler decide the retrieval points for better index quality. From extensive experiments applied on several web communities, we show the effectiveness of our policy outperforms uniform scheduling and the one purely based upon posting generation pattern.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bright, L., Gal, A., Raschid, L.: Adaptive Pull-Based Data Freshness Policies for Diverse Update Patterns.Technical Report, UMIACSTR-2004-01, University of Maryland
PubSubHubbub protocol, http://code.google.com/p/pubsubhubbub/
Chen, Z., Zhang, L., Wang, W.: PostingRank: Bringing Order to Web Forum Postings. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 377–384. Springer, Heidelberg (2008)
Cho, J., Garcia-Molina, H.: Synchronizing a database to Improve Freshness. In: SIDMOD Conference (2000)
Cho, J., Garcia-Molina, H.: Effective Page Refresh Policies for Web Crawlers. ACM TODS 28(4) (2003)
Cho, J., Garcia-Molina, H.: Estimating Frequency of Change. ACM Transactions on Internet Technology 3(3) (2003)
Sia, K.C., Cho, J., Cho, H.-K.: Efficient Monitoring Algorithm for Fast News Alerts. IEEE Transactions on Knowledge and Data Engineering 19(7) (2007)
Xu, J., Li, Q., Qu, H., Labrinidis, A.: Towards a Content-Proivder-Friendly Web Page Crawler. In: Proceedings of the Tenth International ACM Workshop on the Web and Database (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sun, J., Gao, H., Yang, X. (2010). Towards a Quality-Oriented Real-Time Web Crawler. In: Wang, F.L., Gong, Z., Luo, X., Lei, J. (eds) Web Information Systems and Mining. WISM 2010. Lecture Notes in Computer Science, vol 6318. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16515-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-16515-3_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16514-6
Online ISBN: 978-3-642-16515-3
eBook Packages: Computer ScienceComputer Science (R0)