Skip to main content

Improving the Quality of Web Archives through the Importance of Changes

  • Conference paper
Database and Expert Systems Applications (DEXA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6860))

Included in the following conference series:

Abstract

Due to the growing importance of the Web, several archiving institutes (national libraries, Internet Archive, etc.) are harvesting sites to preserve (a part of) the Web for future generations. A major issue encountered by archivists is to preserve the quality of web archives. One way of assessing the quality of an archive is to quantify its completeness and the coherence of its page versions. Due to the large number of pages to be captured and the limitations of resources (storage space, bandwidth, etc.), it is impossible to have a complete archive (containing all the versions of all the pages). Also it is impossible to assure the coherence of all captured versions because pages are changing very frequently during the crawl of a site. Nonetheless, it is possible to maximize the quality of archives by adjusting web crawlers strategy. Our idea for that is (i) to improve the completeness of the archive by downloading the most important versions and (ii) to keep the most important versions as coherent as possible. Moreover, we introduce a pattern model which describes the behavior of the importance of pages changes over time. Based on patterns, we propose a crawl strategy to improve both the completeness and the coherence of web archives. Experiments based on real patterns show the usefulness and the effectiveness of our approach.

This research is supported by the French National Research Agency ANR in the CARTEC Project (ANR-07-MDCO-016).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: understanding the dynamics of web content. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, Barcelona, Spain (2009)

    Google Scholar 

  2. Ben Saad, M., Gançarski, S.: Using visual pages analysis for optimizing web archiving. In: EDBT/ICDT PhD Workshops, Lausanne, Switzerland (2010)

    Google Scholar 

  3. Ben Saad, M., Gançarski, S.: Archiving the Web using Page Changes Pattern: A Case Study. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011), Ottawa, Canada (2011)

    Google Scholar 

  4. Brewington, B.E., Cybenko, G.: Keeping up with the changing web. Computer 33(5) (2000)

    Google Scholar 

  5. Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for web crawling. In: LA-WEBMEDIA 2004: Proceedings of the WebMedia (2004)

    Google Scholar 

  6. Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209. San Francisco, CA, USA (2000)

    Google Scholar 

  7. Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)

    Article  Google Scholar 

  8. Cho, J., Garcia-molina, H.: Estimating frequency of change. ACM Transactions on Internet Technology 3, 256–290 (2003)

    Article  Google Scholar 

  9. Cho, J., Garcia-molina, H., Page, L.: Efficient crawling through url ordering. In: Computer Networks and ISDN Systems, pp. 161–172 (1998)

    Google Scholar 

  10. Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: Sharc: framework for quality-conscious web archiving. Proc. VLDB Endow. 2(1), 586–597 (2009)

    Article  Google Scholar 

  11. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. In: Data Mining and Knowledge Discovery, vol. 15 (2007)

    Google Scholar 

  12. Masanès, J.: Web Archiving. Springer, New York (2006)

    Book  Google Scholar 

  13. Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web (2008)

    Google Scholar 

  14. Pehlivan, Z., Ben-Saad, M., Gançarski, S.: Vi-DIFF: Understanding web pages changes. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6261, pp. 1–15. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  15. Sia, K.C., Cho, J., Cho, H.-K.: Efficient monitoring algorithm for fast news alerts. IEEE Transactions on Knowledge and Data Engineering 19, 950–961 (2007)

    Article  Google Scholar 

  16. Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: WICOW 2009: Proceedings of the 3rd Workshop on Information Credibility on the Web, pp. 19–26 (2009)

    Google Scholar 

  17. Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: ”catch me if you can”: Visual analysis of coherence defects in web archiving. In: 9th International Web Archiving Workshop (IWAW 2009), Corfu, Greece, pp. 27–37 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Saad, M.B., Gançarski, S. (2011). Improving the Quality of Web Archives through the Importance of Changes. In: Hameurlain, A., Liddle, S.W., Schewe, KD., Zhou, X. (eds) Database and Expert Systems Applications. DEXA 2011. Lecture Notes in Computer Science, vol 6860. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23088-2_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23088-2_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23087-5

  • Online ISBN: 978-3-642-23088-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics