ABSTRACT
Archivists and librarians face an ever increasing amount of digital material. Their task is to preserve its authentic content. In the long run, this requires periodic migrations (from one format to another or from one hardware/software platform to another). Document migrations are challenging tasks where tool-support and a high degree of automation are important. A central aspect is that documents are often mutually related and, hence, a document's semantics has to be considered in its whole context. References between documents are usually formulated in graph- or tree-based query languages like URL or XPath. A typical scenario is web-archiving where websites are stored inside a server infrastructure that can be queried from HTML-files using URLs. Migrating websites will often require link adaptation in order to preserve link consistency. Although automated and "trustworthy" preservation of link consistency is easy to postulate, it is hard to carry out, in particular, if "trustworthy" means "provably working correct". In this paper, we propose a general approach to semantically evaluating and constructing graph queries, which at the same time conform to a regular grammar, appear as part of a document's content, and access a graph structure that is specified using First- Order Predicate Logic (FOPL). In order to do so, we adapt model checking techniques by constructing suitable query automata. We integrate these techniques into our preservation framework [12] and show the feasibility of this approach using an example. We migrate a website to a specific archiving format and demonstrate the automated preservation of link-consistency. The approach shown in this paper mainly contributes to a higher degree of automation in document migration while still maintaining a high degree of "trustworthiness", namely "provable correctness".
- Cascading style sheets, level 2 revision 1, CSS 2.1 specification. Technical report, W3C, 2006. http://www.w3.org/TR/CSS21/.Google Scholar
- U. M. Borghoff, P. Rödig, J. Scheffczyk, and L. Schmitz. Long-Term Preservation of Digital Documents: Priciples and Practices. Springer Verlag, Berlin, Heidelberg, New York, 2006. ISBN: 978-3-540-33639-6. Google ScholarDigital Library
- B. Chidlovskii. Using regular tree automata as XML schemas. In Proc. Int. IEEE Conf. on Advances in Dig. Lib (ADL2000), pages 89--98, Washington, DC, USA, May 2000. Google ScholarDigital Library
- Consultative Committee for Space Data Systems CCSDS. Reference model for an open archival information system (OAIS). Technical report, Space Data Systems, 2002.Google Scholar
- J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 3rd edition, July 2006. ISBN: 978-0321455369. Google ScholarDigital Library
- J. l. Maitre. Describing multistructured XML documents by means of delay nodes. In Proc. of the ACM Symp. on Doc. Eng. (DocEng 2006), pages 155--164, Amsterdam, The Netherlands, October 2006. Google ScholarDigital Library
- C. Nentwich, L. Capra, W. Emmerich, and A. Finkelstein. xlinkit: a consistency checking and smart link generation service. ACM Trans. Inter. Tech., 2(2):151--185, 2002. Google ScholarDigital Library
- Network Working Group. Uniform resource identifiers (URI): Generic syntax, 1998. http://www.ietf.org/rfc/rfc2396.txt.Google Scholar
- F. Neven. Automata, logic, and XML. In Proc. of the 16th Int. Workshop CSL 2002, 11th Ann. Conf. of the EACSL, volume 2471, pages 2--26, Edingburgh, UK, Sep. 2002. Springer LNCS. Google ScholarDigital Library
- J. Scheffczyk, U. M. Borghoff, P. Rödig, and L. Schmitz. Managing inconsistent repositories via prioritized repair actions. In Proceedings of the ACM Symposium on Document Engineering (DocEng 2004), pages 137--146. ACM, October 2004. Google ScholarDigital Library
- T. Triebsees and U. M. Borghoff. Preservation-centric and constraint-based migration of digital documents. In Proc. of the ACM Symp. on Doc. Eng. (DocEng 2006), pages 59-62, Amsterdam, The Netherlands, October 2006. Google ScholarDigital Library
- T. Triebsees and U. M. Borghoff. A theory for model-based transformation applied to computer-supported preservation in digital archives. In Proc. 14th Ann. IEEE Int. Conf. on the Eng. of Comp. Based Systems (ECBS'07), Tucson, AZ, USA, March 2007. IEEE Computer Society Press. Google ScholarDigital Library
Index Terms
- Towards automatic document migration: semantic preservation of embedded queries
Recommendations
The Florida Digital Archive and DAITSS: a working preservation repository based on format migration
The Florida Digital Archive is a long-term digital preservation repository for the use of the libraries of the public universities of Florida. It is managed by the Florida Center for Library Automation (FCLA) and based on Dark Archive in the Sunshine ...
Towards SIRF: self-contained information retention format
SYSTOR '11: Proceedings of the 4th Annual International Conference on Systems and StorageMany organizations are now required to preserve and maintain access to large volumes of digital content for dozens of years. There is a need for preservation systems and processes to support such long-term retention requirements and enable the usability ...
The Florida Digital Archive and DAITSS: a working preservation repository based on format migration
The Florida Digital Archive is a long-term digital preservation repository for the use of the libraries of the public universities of Florida. It is managed by the Florida Center for Library Automation (FCLA) and based on Dark Archive in the Sunshine ...
Comments