Skip to main content

News Page Discovery Policy for Instant Crawlers

  • Conference paper
Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

  • 1392 Accesses

Abstract

Many news pages which are of high freshness requirements are published on the internet every day. They should be downloaded immediately by instant crawlers. Otherwise, they will become outdated soon. In the past, instant crawlers only downloaded pages from a manually generated news website list. Bandwidth is wasted in downloading non-news pages because news websites do not publish news pages exclusively. In this paper, a novel approach is proposed to discover news pages. This approach includes seed selection and news URL prediction based on user behavior analysis. Empirical studies in a user access log for two months show that our approach outperforms the traditional approach in both precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fetterly, D., Manasse, M., Najork, M., Wiener, J.L.: A Large-scale Study of the Evolution of Web Pages. Software Practice and Experience (2004)

    Google Scholar 

  2. Brewington, B., Cybenko, G.: How Dynamic is the Web. In: Proceedings of WWW9 –9th International World Wide Web Conference (IW3C2), pp. 264–296 (2000)

    Google Scholar 

  3. Cho, J., Garcia-Molina, H.: Effective Page Refresh Policies for Web Crawlers. ACM Transactions on Database Systems (TODS) (2003)

    Google Scholar 

  4. Shkapenyuk, V., Suel, T.: Design and Implementation of a High-performance Distributed Web Crawler. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, Calif. (2002)

    Google Scholar 

  5. Barbosa, L., Salgado, A.C., Carvalho, F., Robin, J., Freire, J.: Workshop On Web Information And Data Management. In: Proceedings of the 7th annual ACM international workshop on Web information and data management (2005)

    Google Scholar 

  6. Menczer, F., Belew, R.: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning 39(23), 203–242 (2000)

    Article  MATH  Google Scholar 

  7. Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003)

    Google Scholar 

  8. Stamatakis, K., Karkaletsis, V., Paliouras, G., Horlock, J., et al.: Domain-specific Web Site Identification: the CROSSMARC focused Web crawler. In: Proceedings of the 2nd International Workshop on Web Document Analysis (WDA 2003), Edinburgh, UK (2003)

    Google Scholar 

  9. Menczer, F., Pant, G., Srinivasan, P.: Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004)

    Article  Google Scholar 

  10. Cho, J., Garcia-Molina, H., Page, L.: Effecient Crawling through URL Ordering. WWW8 / Computer Networks 30(1-7), 161–172 (1998)

    Article  Google Scholar 

  11. Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the Web Frontier. In: Proc. 13th WWW, pp. 309–318 (2004)

    Google Scholar 

  12. Eiron, N., McCurley, K.S.: Locality, Hierarchy, and Bidirectionality in the Web. In: Workshop on Algorithms and Models for the Web Graph, Budapest (2003)

    Google Scholar 

  13. Abiteboul, S., Preda, M., Cobena, G.: Adaptive On-line Page Importance Computation. In: Proc. 12th World Wide Web Conference, pp. 280–290 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, Y., Liu, Y., Zhang, M., Ma, S. (2008). News Page Discovery Policy for Instant Crawlers. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_58

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_58

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics