Skip to main content
Log in

InfoMonitor: unobtrusively archiving a World Wide Web server

  • Regular contribution
  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

It is important to provide long-term preservation of digital data even when those data are stored in an unreliable system such as a filesystem, a legacy database, or even the World Wide Web. In this paper we focus on the problem of archiving the contents of a Web site without disrupting users who maintain the site. We propose an archival storage system, the InfoMonitor, in which a reliable archive is integrated with an unmodified existing store. Implementing such a system presents various challenges related to the mismatch of features between the components such as differences in naming and data manipulation operations. We examine each of these issues as well as solutions for the conflicts that arise. We also discuss our experience using the InfoMonitor to archive the Stanford Database Group’s Web site.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Chawathe SS, Rajarman A, Garcia-Molina H, Widom J (1996) Change detection in hierarchically structured information. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, June 1996, pp 493–504

  2. Chen Y, Edler J, Goldberg A, Gottlieb A, Sobti S, Yianilos P (1999) A prototype implementation of archival intermemory. In: Proceedings of the 4th ACM international conference on digital libraries, Berkeley, CA, August 1999, pp 28–37

  3. Chervenak A, Vellanki V, Kurmas Z (1998) Protecting file systems: a survey of backup techniques. In: Proceedings of the joint NASA and IEEE mass storage conference, College Park, MD, March 1998, pp 17–32

  4. Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In: Proceedings of the conference on very large databases (VLDB), Cairo, Egypt, September 2000, pp 200–209

  5. Cooper B, Crespo A, Garcia-Molina H (2000) Implementing a reliable digital object archive. In: Proceedings of the 4th European conference on research and technologies for digital libraries (ECDL), Lisbon, Portugal, September 2000, pp 128–143

  6. IBM Corporation (1999) Adstar distributed storage manager (ADSM) – distributed data recovery white paper. http://www.storage.ibm.com/storage/software/adsm/adwhddr.htm

  7. Inktomi Corporation (2000) Web surpasses one billion documents. http://web.archive.org/web/20001013101834/http://www.inktomi.com/new/press/billion.html

  8. Microsoft Corporation (2000) Microsoft FrontPage. http://www.microsoft.com/frontpage/

  9. Crespo A, Garcia-Molina H (1997) Awareness services for digital libraries. In: Lecture notes in computer science, vol 1324. Springer, Berlin Heidelberg New York

  10. Crespo A, Garcia-Molina H (1998) Archival storage for digital libraries. In: Proceedings of the 3rd ACM international conference on digital libraries, Pittsburgh, June 1998. http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1998-0082

  11. Garrett J, Waters D (1996) Preserving digital information: report of the Task Force on Archiving of Digital Information, May 1996. http://www.rlg.org/ArchTF/

  12. Goldberg A, Yianilos P (1998) Towards an archival intermemory. In: Proceedings of IEEE Advances in Digital Libraries (ADL), Santa Barbara, CA, April 1998, pp 147–156

  13. Haake A, Hicks D (1996) Verse: towards hypertext versioning styles. Hypertext ’96

  14. Halpern J, Lagoze C (1999) The Computing Research Repository: promoting the rapid dissemination and archiving of computer science research. In: Proceedings of the 4th ACM international conference on digital libraries, Berkeley, CA, August, 1999, pp 3–11

  15. Huchinson NC, Manley S, Federwisch M, Harris G, Hitz D, Kleiman S, O’Malley S (1999) Logical vs. physical file system backup. In: Proceedings of the 3rd USENIX symposium on operating systems design and implementation (OSDI), New Orleans, February 1999, pp 239–249

  16. Tivoli Systems Inc (1999) Tivoli storage manager. http://www.tivoli.com/products/index/storage_mgr/

  17. UniTree Software Inc (1999) Unitree technical overview. http://www.unitree.com/overview/overview.htm

  18. Khalidi Y, Nelson M (1993) Extensible file systems in spring. In: Proceedings of the 14th symposium on operating systems principles, Asheville, NC, December 1993, pp 1–14

  19. King RP, Halim N, Garcia-Molina H, Polyzois CA (1991) Management of a remote backup copy for disaster recovery. ACM Trans Database Sys 16(2):338–68

    Article  Google Scholar 

  20. Labio WJ, Quass D, Adelberg B (1997) Physical database design for data warehousing. In: Proceedings of the international conference on data engineering, Birmingham, UK, April 1997, pp 277–288

  21. Labio W, Garcia-Molina H (1996) Efficient snapshot differential algorithms in data warehousing. In: Proceedings of the 22nd international conference on very large databases, Mumbai (Bombay), India, September 1996, pp 63–74

  22. Lorie RA (2001) Long term preservation of digital information. In: Proceedings of the 1st joint ACM/IEEE conference on digital libraries (JCDL), Roanoke, VA, June 2001, pp 346–352

  23. Patterson D, Gibson G, Katz RH (1988) A case for redundant arrays of inexpensive disks (RAID). SIGMOD Rec 17(3):109–116

    Article  Google Scholar 

  24. Raghavan S, Garcia-Molina H (2001) Crawling the hidden web. In: Proceedings of the conference on very large databases (VLDB), Rome, Italy, September 2001, pp 129–138

  25. Rajasekar A, Marciano R, Moore R (2000) Collection-based persistent archives. http://www.sdsc.edu/NARA/Publications/OTHER/Persistent/Persistent.html

  26. Rosenblum M, Ousterhout JK (1991) The design and implementation of a log-structured file system. In: Proceedings 13th symposium on operating systems principles, Pacific Grove, CA, October 1991, pp 1–15

  27. Saltzer JH (1992) Technology, networks, and the library of the year 2000. In: Bensoussan A, Verjus J-P (eds) Future tendencies in computer science, control, and applied mathematics. Proceedings of the international conference on the occasion of the 25th anniversary of INRIA, Paris, December 1992. Springer, Berlin Heidelberg New York, pp 51–67

  28. Schloss GA, Stonebraker M (1990) Highly redundant management of distributed data. In: Proceedings of the workshop on the management of replicated data, Houston, November 1990. IEEE Press, pp 91–95

  29. Tichy W (1985) RCS – a system for version control. Softw Pract Exper 15(7):637–654

  30. Zhuge Y, Garcia-Molina H, Hammer J, Widom J (1995) View maintenance in a warehousing environment. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, May 1995, pp 316–327

  31. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brian F. Cooper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cooper, B., Garcia-Molina, H. InfoMonitor: unobtrusively archiving a World Wide Web server. Int J Digit Libr 5, 106–119 (2005). https://doi.org/10.1007/s00799-003-0052-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-003-0052-x

Keywords

Navigation