Skip to main content
Log in

Following the dynamic block on the Web

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

With the rapid changes in dynamic web pages, there is an increasing need for receiving instant updates for dynamic blocks on the Web. In this paper, we address the problem of automatically following dynamic blocks in web pages. Given a user-specified block on a web page, we continuously track the content of the block and report the updates in real time. This service can bring obvious benefits to users, such as the ability to track top-ten breaking news on CNN, the prices of iPhones on Amazon, or NBA game scores. We study 3,346 human labeled blocks from 1,127 pages, and analyze the effectiveness of four types of patterns, namely visual area, DOM tree path, inner content and close context, for tracking content blocks. Because of frequent web page changes, we find that the initial patterns generated on the original page could be invalidated over time, leading to the failure of extracting correct blocks. According to our observations, we combine different patterns to improve the accuracy and stability of block extractions. Moreover, we propose an adaptive model that adapts each pattern individually and adjusts pattern weights for an improved combination. The experimental results show that the proposed models outperform existing approaches, with the adaptive model performing the best.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

Notes

  1. www.w3schools.com/xpath

References

  1. Adar, E., Dontcheva, M., Fogarty, J., Weld, D.S.: Zoetrope: interacting with the ephemeral web. In: Proceedings of the 21st annual ACM symposium on User interface software and technology, UIST 08, p. 239C248, CA, USA (2008)

  2. Adar, E., Teevan, J., Dumais, S.T.: Resonance on the web: Web dynamics and revisitation patterns. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 09, p. 1381C1390, MA, USA (2009)

  3. Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: Understanding the dynamics of web content. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM 09, p. 282C291, Barcelona, Spain (2009)

  4. Agrawal, N., Ananthanarayanan, R., Gupta, R., Joshi, S., Krishnapuram, R., Negi, S.: Eshopmonitor: A web content monitoring tool. In: Proceedings of the 20th International Conference on Data Engineering, ICDE 04, p. 817C820, MA, USA (2004)

  5. Anderson, C.R., Horvitz, E.: Web montage: A dynamic personalized start page. In: Proceedings of the 11th International Conference on World Wide Web, WWW 02, p. 704C712, Hawaii, USA (2002)

  6. Boyapati, V., Chevrier, K., Finkel, A., Glance, N., Pierce, T., Stockton, R., Whitmer, C.: Changedetector: A site-level monitoring tool for the www. In: Proceedings of WWW 2002, p. 570C579, Hawaii, USA (2002)

  7. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Vips: a vision-based page segmentation algorithm. In: Microsoft Technical Report, p. MSRCTRC2003C79 (2003)

  8. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based web search. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 04, p. 456C463, Sheffield, United Kingdom (2004)

  9. Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Proceedings of the 26th International Conference on Very Large Data Bases, VLDB 00, p. 200C209, Cairo, Egypt (2000)

  10. Dontcheva, M., Drucker, S.M., Salesin, D., Cohen, M.F.: Changes in webpage structure over time. In: UW CSE Technical Report (2007)

  11. Dontcheva, M., Drucker, S.M., Wade, G., Salesin, D., Cohen, M.F.: Summarizing personal web browsing sessions. In: Proceedings of UIST 2006, p. 115C124, Montreux, Switzerland (2006)

  12. Douglis, F., Ball, T., Chen, Y.f., Koutsofios, E.: The at&t internet difference engine: Tracking and viewing changes on the web. World Wide Web 1(1), 27C44 (1998)

    Article  Google Scholar 

  13. Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proceedings of WWW 2003, p. 669C678, Budapest, Hungary (2003)

  14. Freire, J., Kumar, B., Lieuwen, D.: Webviews: Accessing personalized web content and services. In: Proceedings of WWW 2001, p. 576C586, Hong Kong (2001)

  15. Greenberg, S., Boyle, M.: Generating custom notification histories by tracking visual differences between web page visits. In: Proceedings of Graphics Interface 2006, GI 06, p. 227C234, Quebec, Canada (2006)

  16. Han, J., Han, D., Lin, C., Zeng, H.J., Chen, Z., Yu, Y.: Homepage live: Automatic block tracing for web personalization. In: Proceedings of WWW 2007, p. 1C10, Alberta, Canada (2007)

  17. Hupp, D., Miller, R.C.: Smart bookmarks: automatic retroactive macro recording on the web. In: Proceedings of UIST 2007, p. 81C90, Rhode Island, USA (2007)

  18. Kushmerick, N.: Wrapper induction for information extraction. Ph.D. thesis. University of Washington (1997)

  19. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 03, p. 601C606, Washington, D.C (2003)

  20. Liu, B., Zhai, Y.: Net - a system for extracting web data from flat and nested data records. In: Proceedings of the 6th International Conference on Web Information Systems Engineering, WISE 05, p. 487C495, New York, NY (2005)

  21. Liu, L., Pu, C., Tang, W.: Webcq - detecting and delivering information changes on the web. In: Proceedings of the Ninth International Conference on Information and Knowledge Management, CIKM 00, p. 512C519, Virginia, USA (2000)

  22. Muslea, I., Minton, S.N., Knoblock, C.A.: Active learning with strong and weak views: A case study on wrapper induction. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI 03, p. 415C420, Acapulco, Mexico (2003)

  23. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd. Prentice Hall Press, Upper Saddle River, NJ, USA (2009)

    MATH  Google Scholar 

  24. Sugiura, A., Koseki, Y.: Internet scrapbook: automating web browsing tasks by demonstration. In: Proceedings of UIST 1998, p. 9C18, California, USA (1998)

  25. Teevan, J., Dumais, S.T., Liebling, D.J.: A longitudinal study of how highlighting web content change affects peoples web interactions. In: Proceedings of CHI 2010, p. 1353C1356, Georgia, USA (2010)

  26. Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: Proceedings of UIST 2009, p. 237C246, BC, Canada (2009)

  27. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of WWW 2005, p. 76C85, Chiba, Japan (2005)

  28. Zhai, Y., Liu, B.: Extracting web data using instance-based learning. World Wide Web 10(2), 113C132 (2007)

    Article  Google Scholar 

  29. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of WWW 2005, p. 66C75, Chiba, Japan (2005)

Download references

Acknowledgments

This work was partially supported by the National Basic Research Program of China (973 Program) [No.2014CB340403], the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China [No. 14XNLF05 and No. 15XNLF03], the National Culture Science and Technology Promotion Plan, the National Natural Science Foundation of China [No.61502501], and the secondary network prototype system development project by Xinhua News Agency.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sha Hu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, S., Wen, JR., Dou, Z. et al. Following the dynamic block on the Web. World Wide Web 19, 1077–1101 (2016). https://doi.org/10.1007/s11280-015-0374-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-015-0374-9

Keywords

Navigation