skip to main content
10.1145/1284420.1284472acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

Towards automatic document migration: semantic preservation of embedded queries

Published:28 August 2007Publication History

ABSTRACT

Archivists and librarians face an ever increasing amount of digital material. Their task is to preserve its authentic content. In the long run, this requires periodic migrations (from one format to another or from one hardware/software platform to another). Document migrations are challenging tasks where tool-support and a high degree of automation are important. A central aspect is that documents are often mutually related and, hence, a document's semantics has to be considered in its whole context. References between documents are usually formulated in graph- or tree-based query languages like URL or XPath. A typical scenario is web-archiving where websites are stored inside a server infrastructure that can be queried from HTML-files using URLs. Migrating websites will often require link adaptation in order to preserve link consistency. Although automated and "trustworthy" preservation of link consistency is easy to postulate, it is hard to carry out, in particular, if "trustworthy" means "provably working correct". In this paper, we propose a general approach to semantically evaluating and constructing graph queries, which at the same time conform to a regular grammar, appear as part of a document's content, and access a graph structure that is specified using First- Order Predicate Logic (FOPL). In order to do so, we adapt model checking techniques by constructing suitable query automata. We integrate these techniques into our preservation framework [12] and show the feasibility of this approach using an example. We migrate a website to a specific archiving format and demonstrate the automated preservation of link-consistency. The approach shown in this paper mainly contributes to a higher degree of automation in document migration while still maintaining a high degree of "trustworthiness", namely "provable correctness".

References

  1. Cascading style sheets, level 2 revision 1, CSS 2.1 specification. Technical report, W3C, 2006. http://www.w3.org/TR/CSS21/.Google ScholarGoogle Scholar
  2. U. M. Borghoff, P. Rödig, J. Scheffczyk, and L. Schmitz. Long-Term Preservation of Digital Documents: Priciples and Practices. Springer Verlag, Berlin, Heidelberg, New York, 2006. ISBN: 978-3-540-33639-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Chidlovskii. Using regular tree automata as XML schemas. In Proc. Int. IEEE Conf. on Advances in Dig. Lib (ADL2000), pages 89--98, Washington, DC, USA, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Consultative Committee for Space Data Systems CCSDS. Reference model for an open archival information system (OAIS). Technical report, Space Data Systems, 2002.Google ScholarGoogle Scholar
  5. J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 3rd edition, July 2006. ISBN: 978-0321455369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. l. Maitre. Describing multistructured XML documents by means of delay nodes. In Proc. of the ACM Symp. on Doc. Eng. (DocEng 2006), pages 155--164, Amsterdam, The Netherlands, October 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Nentwich, L. Capra, W. Emmerich, and A. Finkelstein. xlinkit: a consistency checking and smart link generation service. ACM Trans. Inter. Tech., 2(2):151--185, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Network Working Group. Uniform resource identifiers (URI): Generic syntax, 1998. http://www.ietf.org/rfc/rfc2396.txt.Google ScholarGoogle Scholar
  9. F. Neven. Automata, logic, and XML. In Proc. of the 16th Int. Workshop CSL 2002, 11th Ann. Conf. of the EACSL, volume 2471, pages 2--26, Edingburgh, UK, Sep. 2002. Springer LNCS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Scheffczyk, U. M. Borghoff, P. Rödig, and L. Schmitz. Managing inconsistent repositories via prioritized repair actions. In Proceedings of the ACM Symposium on Document Engineering (DocEng 2004), pages 137--146. ACM, October 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Triebsees and U. M. Borghoff. Preservation-centric and constraint-based migration of digital documents. In Proc. of the ACM Symp. on Doc. Eng. (DocEng 2006), pages 59-62, Amsterdam, The Netherlands, October 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Triebsees and U. M. Borghoff. A theory for model-based transformation applied to computer-supported preservation in digital archives. In Proc. 14th Ann. IEEE Int. Conf. on the Eng. of Comp. Based Systems (ECBS'07), Tucson, AZ, USA, March 2007. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards automatic document migration: semantic preservation of embedded queries

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              DocEng '07: Proceedings of the 2007 ACM symposium on Document engineering
              August 2007
              236 pages
              ISBN:9781595937766
              DOI:10.1145/1284420

              Copyright © 2007 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 28 August 2007

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate178of537submissions,33%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader