CPI: Constraints-Preserving Inlining algorithm for mapping XML DTD to relational schema

https://doi.org/10.1016/S0169-023X(01)00028-3Get rights and content

Abstract

As Extensible Markup Language (XML) is emerging as the data format of the Internet era, there are increasing needs to efficiently store and query XML data. One path to this goal is transforming XML data into relational format in order to use relational database technology. Although several transformation algorithms exist, they are incomplete in the sense that they focus only on structural aspects and ignore semantic aspects. In this paper, we present the semantic knowledge that needs to be captured during transformation to ensure a correct relational schema. Further, we show an algorithm that can (1) derive such semantic knowledge from a given XML Document Type Definition (DTD) and (2) preserve the knowledge by representing it as semantic constraints in relational database terms. By combining existing transformation algorithms and our constraints-preserving algorithm, one can transform XML DTD to relational schema where correct semantics and behaviors are guaranteed by the preserved constraints. Experimental results are also presented.

Introduction

As the World-Wide Web becomes a major means of disseminating and sharing information, Extensible Markup Language (XML) [8] is emerging as a possible candidate data format because it is simpler than SGML, and more powerful than HTML. One way to query XML data is to reuse the established relational database techniques by converting and storing XML data in relational storage. Since the hierarchical XML and the flat relational data models are not fully compatible, the transformation is not a straightforward task.

To this end, several XML-to-relational transformation algorithms have been proposed [13], [17], [35]. For instance, Shanmugasundaram et al. [35] present three algorithms that focus on the table level of the schema while Florescu and Kossmann [17] study different performance issues among eight algorithms that focus on the attribute and value level of the schema. They all transform the given XML Document Type Definition (DTD) to relational schema. Similarly, Deutsch et al. [13] present a data mining-based algorithm that instead uses XML documents directly without a DTD.

Although they work well for the given applications, they miss one important point. That is, the transformation algorithms only capture the structure of a DTD and ignore the hidden semantic constraints. Consider the following example.

Example 1

Consider a DTD modeling conference publications:

  • <!ELEMENT conf (title,society,year,mon?,paper+)>

  • <!ELEMENT paper (pid,title,abstract?)>


Suppose the combination of title and year uniquely identifies the conf. Using the hybrid inlining algorithm (explained in Section 3), the DTD would be transformed to the following relational schema:
  • conf (title,society,year,mon)

  • paper (pid,title,conf_title,conf_year,abstract)


While the relational schema correctly captures the structural aspect for the DTD, it does not force correct semantics. For instance, it cannot prevent a tuple t1: paper(100,'DTD...','ER',3000,'...') from being inserted. However, tuple t1 is inconsistent with semantics of the given DTD since the DTD implies that the paper cannot exist without being associated with a conference and there is apparently no conference “ER-3000” yet. In database terms, this kind of violation can be easily prevented by an inclusion dependency saying “paper[conf_title,conf_year]conf[title,year]”.

The reason for this inconsistency between the DTD and the transformed relational schema is that transformation algorithms only capture the structure of the DTD and ignore the hidden semantic constraints. Via our Constraints-Preserving Inlining (CPI) algorithm, we show the kinds of semantic constraints that can be derived from DTDs during transformation, and illustrate how to preserve them by rewriting them in an output schema notation. Since our algorithm to capture and preserve semantic constraints from DTDs is independent of the transformation algorithms, our algorithm can be applied to various transformation processes such as those in [13], [17], [35] with little change. Fig. 1 presents an overview of our approach. First, given a DTD, we transform it to a corresponding relational scheme using an existing algorithm. Second, during the transformation, we discover various semantic constraints in XML notation. Third, we rewrite the discovered constraints to conform to relational notation.

This paper is organized as follows. Section 2 gives background information and related work. In Section 3, the transformation algorithm is discussed in detail. Section 4 presents various semantic constraints that are hidden in DTDs. Section 5 proposes our algorithm to preserve such constraints during transformation. Section 6 reports some experimental results and Section 7 illustrates two example applications where the discovered semantic constraints are further utilized. Finally, 8 Future work, 9 Conclusion discuss our vision on future work and concluding remarks.

Section snippets

Background and related work

Relational schema: We define a relational schema R to be composed of a relational scheme (S) and semantic constraints (Δ). That is, R=(S). In turn, the relational scheme S is a collection of table schemes such as r(a1,…,ak), where ai is the ith attribute in the table r and the semantic constraints Δ is a collection of semantic knowledge such as domain constraints, inclusion dependency, equality-generating dependency, tuple-generating dependency, etc.

XML and DTD: XML is a textual

Transforming DTD to relational schema

Transforming a hierarchical XML model to a flat relational model is not a trivial task. There are several difficulties including non 1-to-1 mapping, set values, recursion, and fragmentation issues [35]. For a better presentation, we chose one particular transformation algorithm, called the hybrid inlining algorithm [35] among many algorithms [7], [13], [17], [35]. It is chosen since it exhibits the pros of the other two competing algorithms in [35] without severe side effects and it is a more

Domain Constraints

When the domain of the attributes is restricted to a certain specified set of values, it is called Domain Constraints. For instance, in the following DTD, the domain of the attributes gender and married are restricted.

  • <!ATTLIST author gender (male|female) #REQUIRED

  •     married (yes|no)#IMPLIED>


In transforming such DTD into relational schema, we can enforce the domain constraints using SQL CHECK clause as follows:

  • CREATE DOMAIN gender VARCHAR(10) CHECK (VALUE IN (male”, “female))

  • CREATE DOMAIN

Discovering and preserving semantic constraints

To help find semantic constraints, we use the following data structure:

Definition 1

An annotated DTD graph (ADG) G is a pair (V, E), where V is a finite set and E is a binary relation on V. The set V consists of element and attributes in a DTD. Each edge e∈E is labeled with the cardinality relationship types as defined in Section 4.2. In addition, each vertex v∈V carries the following information:

  • 1.

    indegree stores the number of incoming edges.

  • 2.

    type contains the element type name in the content model of the DTD

Experimental results

We have implemented the CPI algorithm in Java using the IBM XML4J package. Table 7 shows a summary of our experimentation. We gathered test DTDs from “http://www.oasis-open.org/cover/xml.html” and [32]. Since some DTDs had syntactic errors caught by the XML4J, we had to modify them manually. Note that people seldom used the ID and IDREF(S) constructs in their DTDs except the XMI and BSML cases. The number of tables generated in the relational schema was usually smaller than that of

Application of the semantic constraints

The constraints that are discovered during the transformation are useful to ensure correct semantics of the resulting relational schema. Additionally, they can be used as semantic knowledge in a variety of areas [1], [6], [23], [39]. Since the focus of this paper is not on the application of the constraints, in this section, we will only illustrate a few motivating examples for the possible applications.

Future work

Due to many benefits from using relational databases as storage systems for XML data, the need for efficient and effective conversion between relational and XML models will significantly grow in a foreseeable future. We believe that the following directions of research are very important.

First, as we move to more expressive next generation XML schema languages such as XML-Schema [14] or RELAX [29], the degree of complexities captured in an XML schema is far greater than that in a DTD. For

Conclusion

This paper presents a method to transform XML DTD to relational schema both in structural and semantic aspects. After discussing the semantic constraints hidden in DTDs, two algorithms are presented for: (1) discovering the semantic constraints using the hybrid inlining algorithm, and (2) rewriting the semantic constraints in relational notation. Our experimental results reveal that constraints can be systematically preserved during the conversion from XML to relational schema. Such constraints

Dongwon Lee received his B.S. from Korea University, Seoul, Korea in 1993 and M.S. from Columbia University, New York, USA in 1995, both in Computer Science. Afterwards, He has worked at AT&T Bell Labs (now AT&T Labs – Research) from 1995 to 1997. He is currently working towards his Ph.D. in Computer Science at UCLA. His research interests include intelligent information systems, semi-structured and XML databases, and World-Wide Web.

References (39)

  • S. Abiteboul et al.

    Data on the Web: From Relations to Semistructured Data and XML

    (1999)
  • S. Banerjee et al.

    Oracle8i – The XML enabled data management system

    IEEE ICDE, San Diego, CA

    (February 2000)
  • C. Batini et al.

    Conceptual Database Design: An Entity-Relationship Approach

    (1992)
  • P. Bernstein et al.

    A vision of management of complex models

    ACM SIGMOD Record

    (2000)
  • P.V. Biron, A. Malhotra (Eds.), XML Schema Part 2: Datatypes, W3C Recommendation, http://www.w3.org/TR/xmlschema-2/,...
  • K. Böhm et al.

    Query optimization for structured documents based on knowledge on the document type definition

    IEEE Advances in Digital Libraries (ADL). Los Altos, CA

    (April 1998)
  • R. Bourret, XML and Databases, Web page, http://www.rpbourret.com/xml/XMLAndDatabases.htm, September...
  • T. Bray, J. Paoli, C.M. Sperberg-McQueen (Eds.), Extensible Markup Language (XML) 1.0 (2nd Edition), W3C...
  • P. Buneman et al.

    Path constraints in semistructured and structured databases

    ACM PODS, Seattle, WA

    (1998)
  • M. Carey, D. Florescu, Z. Ives, Y. Lu, J. Shanmugasundaram, E. Shekita, S. Subramanian, XPERANTO: Publishing...
  • J.M. Cheng et al.

    XML and DB2

    IEEE ICDE, San Diego, CA

    (February 2000)
  • V. Christophides et al.

    From structured document to novel query facilities

    ACM SIGMOD, Minneapolis, MN

    (June 1994)
  • A. Deutsch et al.

    Storing semistructured data with STORED

    ACM SIGMOD, Philadephia, PA

    (June 1998)
  • D.C. Fallside (Ed.), XML Schema Part 0: Primer, W3C Recommendation, http://www.w3.org/TR/xmlschema-0, May...
  • W. Fan et al.

    Integrity constraints for XML

    ACM PODS, Dallas, TX

    (May 2000)
  • M.F. Fernandez, W.-C. Tan, D. Suciu, SilkRoute: Trading between relations and XML, in: International World Wide Web...
  • D. Florescu et al.

    Storing and querying XML data using and RDBMS

    IEEE Data Eng. Bull.

    (1999)
  • M.N. Garofalakis et al.

    XTRACT: A system for extracting document type descriptors from XML documents

    ACM SIGMOD, Dallas, TX

    (May 2000)
  • J.E. Hopcroft et al.

    Introduction to Automata Theory, Language, and Computation

    (2001)
  • Cited by (43)

    • Constraint-aware Schema Transformation

      2012, Electronic Notes in Theoretical Computer Science
    • Coupled Transformation of Schemas, Documents, Queries, and Constraints

      2008, Electronic Notes in Theoretical Computer Science
    • Efficient schema-based XML-to-Relational data mapping

      2007, Information Systems
      Citation Excerpt :

      The actual constraint information can be derived from the original DTD and introduced to the database schema by revisiting the original DTD later. Interested readers are referred to [5,34] where capturing semantic knowledge from a DTD and introducing it to a database schema through semantic constraints are discussed in detail. Two pieces of information are essential for the reconstruction of an XML document from its relational representation and for answering XML queries against the relational storage of an XML document: (1) the parent–child relationships between XML elements and (2) the document order.

    • Propagating XML constraints to relations

      2007, Journal of Computer and System Sciences
    • XML application schema matching using similarity measure and relaxation labeling

      2005, Information Sciences
      Citation Excerpt :

      The most commonly used components include element declarations, attribute declarations, simple type definitions, and complex type definitions. The semantic aspects of XML DTD have been discussed in relation to the transforming XML data into relational schema [22]. However, the XML SDL is more expressive than DTD and is now more widely used in a variety of applications.

    View all citing articles on Scopus

    Dongwon Lee received his B.S. from Korea University, Seoul, Korea in 1993 and M.S. from Columbia University, New York, USA in 1995, both in Computer Science. Afterwards, He has worked at AT&T Bell Labs (now AT&T Labs – Research) from 1995 to 1997. He is currently working towards his Ph.D. in Computer Science at UCLA. His research interests include intelligent information systems, semi-structured and XML databases, and World-Wide Web.

    Wesley W. Chu is a professor of Computer Science and was the past chairman (1988–1991) of the Computer Science Department at the University of California, Los Angles. His current research interest is in the areas of distributed processing, knowledge-based information systems, and intelligent web-based databases. He was the conference chair of the 16th International Conference on Conceptual Modeling (ER'97). He is also currently a member of the Editorial Board of the Journal on Very Large Data Bases and an Associate Editor for the Journal of Data and Knowledge Engineering. Dr. Chu is a Fellow of IEEE.

    View full text