CPI: Constraints-Preserving Inlining algorithm for mapping XML DTD to relational schema
Introduction
As the World-Wide Web becomes a major means of disseminating and sharing information, Extensible Markup Language (XML) [8] is emerging as a possible candidate data format because it is simpler than SGML, and more powerful than HTML. One way to query XML data is to reuse the established relational database techniques by converting and storing XML data in relational storage. Since the hierarchical XML and the flat relational data models are not fully compatible, the transformation is not a straightforward task.
To this end, several XML-to-relational transformation algorithms have been proposed [13], [17], [35]. For instance, Shanmugasundaram et al. [35] present three algorithms that focus on the table level of the schema while Florescu and Kossmann [17] study different performance issues among eight algorithms that focus on the attribute and value level of the schema. They all transform the given XML Document Type Definition (DTD) to relational schema. Similarly, Deutsch et al. [13] present a data mining-based algorithm that instead uses XML documents directly without a DTD.
Although they work well for the given applications, they miss one important point. That is, the transformation algorithms only capture the structure of a DTD and ignore the hidden semantic constraints. Consider the following example. Example 1 Consider a DTD modeling conference publications: <!ELEMENT conf (title,society,year,mon?,paper+)> <!ELEMENT paper (pid,title,abstract?)> conf (title,society,year,mon) paper (pid,title,conf_title,conf_year,abstract)
Suppose the combination of title and year uniquely identifies the conf. Using the hybrid inlining algorithm (explained in Section 3), the DTD would be transformed to the following relational schema:
While the relational schema correctly captures the structural aspect for the DTD, it does not force correct semantics. For instance, it cannot prevent a tuple t1: paper(100,'DTD...','ER',3000,'...') from being inserted. However, tuple t1 is inconsistent with semantics of the given DTD since the DTD implies that the paper cannot exist without being associated with a conference and there is apparently no conference “ER-3000” yet. In database terms, this kind of violation can be easily prevented by an inclusion dependency saying “paper[conf_title,conf_year] ⊆ conf[title,year]”.
The reason for this inconsistency between the DTD and the transformed relational schema is that transformation algorithms only capture the structure of the DTD and ignore the hidden semantic constraints. Via our Constraints-Preserving Inlining (CPI) algorithm, we show the kinds of semantic constraints that can be derived from DTDs during transformation, and illustrate how to preserve them by rewriting them in an output schema notation. Since our algorithm to capture and preserve semantic constraints from DTDs is independent of the transformation algorithms, our algorithm can be applied to various transformation processes such as those in [13], [17], [35] with little change. Fig. 1 presents an overview of our approach. First, given a DTD, we transform it to a corresponding relational scheme using an existing algorithm. Second, during the transformation, we discover various semantic constraints in XML notation. Third, we rewrite the discovered constraints to conform to relational notation.
This paper is organized as follows. Section 2 gives background information and related work. In Section 3, the transformation algorithm is discussed in detail. Section 4 presents various semantic constraints that are hidden in DTDs. Section 5 proposes our algorithm to preserve such constraints during transformation. Section 6 reports some experimental results and Section 7 illustrates two example applications where the discovered semantic constraints are further utilized. Finally, 8 Future work, 9 Conclusion discuss our vision on future work and concluding remarks.
Section snippets
Background and related work
Relational schema: We define a relational schema to be composed of a relational scheme () and semantic constraints (Δ). That is, ). In turn, the relational scheme is a collection of table schemes such as r(a1,…,ak), where ai is the ith attribute in the table r and the semantic constraints Δ is a collection of semantic knowledge such as domain constraints, inclusion dependency, equality-generating dependency, tuple-generating dependency, etc.
XML and DTD: XML is a textual
Transforming DTD to relational schema
Transforming a hierarchical XML model to a flat relational model is not a trivial task. There are several difficulties including non 1-to-1 mapping, set values, recursion, and fragmentation issues [35]. For a better presentation, we chose one particular transformation algorithm, called the hybrid inlining algorithm [35] among many algorithms [7], [13], [17], [35]. It is chosen since it exhibits the pros of the other two competing algorithms in [35] without severe side effects and it is a more
Domain Constraints
When the domain of the attributes is restricted to a certain specified set of values, it is called Domain Constraints. For instance, in the following DTD, the domain of the attributes gender and married are restricted.
<!ATTLIST author gender (male|female) #REQUIRED
married (yes|no)#IMPLIED>
In transforming such DTD into relational schema, we can enforce the domain constraints using SQL CHECK clause as follows:
CREATE DOMAIN gender VARCHAR(10) CHECK (VALUE IN (“male”, “female”))
CREATE DOMAIN
Discovering and preserving semantic constraints
To help find semantic constraints, we use the following data structure: Definition 1 An annotated DTD graph (ADG) is a pair (, ), where is a finite set and is a binary relation on . The set consists of element and attributes in a DTD. Each edge is labeled with the cardinality relationship types as defined in Section 4.2. In addition, each vertex carries the following information: indegree stores the number of incoming edges. type contains the element type name in the content model of the DTD
Experimental results
We have implemented the CPI algorithm in Java using the IBM XML4J package. Table 7 shows a summary of our experimentation. We gathered test DTDs from “http://www.oasis-open.org/cover/xml.html” and [32]. Since some DTDs had syntactic errors caught by the XML4J, we had to modify them manually. Note that people seldom used the ID and IDREF(S) constructs in their DTDs except the XMI and BSML cases. The number of tables generated in the relational schema was usually smaller than that of
Application of the semantic constraints
The constraints that are discovered during the transformation are useful to ensure correct semantics of the resulting relational schema. Additionally, they can be used as semantic knowledge in a variety of areas [1], [6], [23], [39]. Since the focus of this paper is not on the application of the constraints, in this section, we will only illustrate a few motivating examples for the possible applications.
Future work
Due to many benefits from using relational databases as storage systems for XML data, the need for efficient and effective conversion between relational and XML models will significantly grow in a foreseeable future. We believe that the following directions of research are very important.
First, as we move to more expressive next generation XML schema languages such as XML-Schema [14] or RELAX [29], the degree of complexities captured in an XML schema is far greater than that in a DTD. For
Conclusion
This paper presents a method to transform XML DTD to relational schema both in structural and semantic aspects. After discussing the semantic constraints hidden in DTDs, two algorithms are presented for: (1) discovering the semantic constraints using the hybrid inlining algorithm, and (2) rewriting the semantic constraints in relational notation. Our experimental results reveal that constraints can be systematically preserved during the conversion from XML to relational schema. Such constraints
Dongwon Lee received his B.S. from Korea University, Seoul, Korea in 1993 and M.S. from Columbia University, New York, USA in 1995, both in Computer Science. Afterwards, He has worked at AT&T Bell Labs (now AT&T Labs – Research) from 1995 to 1997. He is currently working towards his Ph.D. in Computer Science at UCLA. His research interests include intelligent information systems, semi-structured and XML databases, and World-Wide Web.
References (39)
- et al.
Data on the Web: From Relations to Semistructured Data and XML
(1999) - et al.
Oracle8i – The XML enabled data management system
IEEE ICDE, San Diego, CA
(February 2000) - et al.
Conceptual Database Design: An Entity-Relationship Approach
(1992) - et al.
A vision of management of complex models
ACM SIGMOD Record
(2000) - P.V. Biron, A. Malhotra (Eds.), XML Schema Part 2: Datatypes, W3C Recommendation, http://www.w3.org/TR/xmlschema-2/,...
- et al.
Query optimization for structured documents based on knowledge on the document type definition
IEEE Advances in Digital Libraries (ADL). Los Altos, CA
(April 1998) - R. Bourret, XML and Databases, Web page, http://www.rpbourret.com/xml/XMLAndDatabases.htm, September...
- T. Bray, J. Paoli, C.M. Sperberg-McQueen (Eds.), Extensible Markup Language (XML) 1.0 (2nd Edition), W3C...
- et al.
Path constraints in semistructured and structured databases
ACM PODS, Seattle, WA
(1998) - M. Carey, D. Florescu, Z. Ives, Y. Lu, J. Shanmugasundaram, E. Shekita, S. Subramanian, XPERANTO: Publishing...
XML and DB2
IEEE ICDE, San Diego, CA
From structured document to novel query facilities
ACM SIGMOD, Minneapolis, MN
Storing semistructured data with STORED
ACM SIGMOD, Philadephia, PA
Integrity constraints for XML
ACM PODS, Dallas, TX
Storing and querying XML data using and RDBMS
IEEE Data Eng. Bull.
XTRACT: A system for extracting document type descriptors from XML documents
ACM SIGMOD, Dallas, TX
Introduction to Automata Theory, Language, and Computation
Cited by (43)
Constraint-aware Schema Transformation
2012, Electronic Notes in Theoretical Computer ScienceCoupled Transformation of Schemas, Documents, Queries, and Constraints
2008, Electronic Notes in Theoretical Computer ScienceEfficient schema-based XML-to-Relational data mapping
2007, Information SystemsCitation Excerpt :The actual constraint information can be derived from the original DTD and introduced to the database schema by revisiting the original DTD later. Interested readers are referred to [5,34] where capturing semantic knowledge from a DTD and introducing it to a database schema through semantic constraints are discussed in detail. Two pieces of information are essential for the reconstruction of an XML document from its relational representation and for answering XML queries against the relational storage of an XML document: (1) the parent–child relationships between XML elements and (2) the document order.
Propagating XML constraints to relations
2007, Journal of Computer and System SciencesDevelopment of a deterministic XML schema by resolving structure ambiguity of HL7 messages
2005, Computer Methods and Programs in BiomedicineXML application schema matching using similarity measure and relaxation labeling
2005, Information SciencesCitation Excerpt :The most commonly used components include element declarations, attribute declarations, simple type definitions, and complex type definitions. The semantic aspects of XML DTD have been discussed in relation to the transforming XML data into relational schema [22]. However, the XML SDL is more expressive than DTD and is now more widely used in a variety of applications.
Dongwon Lee received his B.S. from Korea University, Seoul, Korea in 1993 and M.S. from Columbia University, New York, USA in 1995, both in Computer Science. Afterwards, He has worked at AT&T Bell Labs (now AT&T Labs – Research) from 1995 to 1997. He is currently working towards his Ph.D. in Computer Science at UCLA. His research interests include intelligent information systems, semi-structured and XML databases, and World-Wide Web.
Wesley W. Chu is a professor of Computer Science and was the past chairman (1988–1991) of the Computer Science Department at the University of California, Los Angles. His current research interest is in the areas of distributed processing, knowledge-based information systems, and intelligent web-based databases. He was the conference chair of the 16th International Conference on Conceptual Modeling (ER'97). He is also currently a member of the Editorial Board of the Journal on Very Large Data Bases and an Associate Editor for the Journal of Data and Knowledge Engineering. Dr. Chu is a Fellow of IEEE.