XML data exchange with target constraints

https://doi.org/10.1016/j.ipm.2012.09.002Get rights and content

Abstract

Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema, by following a mapping between the two schemas. There is a rich literature on problems related to data exchange, e.g., the design of a schema mapping language, the consistency of schema mappings, operations on mappings, and query answering over mappings. Data exchange is extensively studied on relational model, and is also recently discussed for XML data. This article investigates the construction of target instance for XML data exchange, which has received far less attention. We first present a rich language for the definition of schema mappings, which allow one to use various forms of document navigation and specify conditions on data values. Given a schema mapping, we then provide an algorithm to construct a canonical target instance. The schema mapping alone is not adequate for expressing target semantics, and hence, the canonical instance is in general not optimal. We recognize that target constraints play a crucial role in the generation of good solutions. In light of this, we employ a general XML constraint model to define target constraints. Structural constraints and keys are used to identify a certain entity, as rules for data merging. Moreover, we develop techniques to enforce non-key constraints on the canonical target instance, by providing a chase method to reason about data. Experimental results show that our algorithms scale well, and are effective in producing target instances of good quality.

Highlights

► Based on tree patterns and equality formulae, we present a rich language for the definition of schema mapping. ► For a given XML data exchange setting and a source instance, we propose an algorithm to construct a canonical solution. ► We employ target constraints to refine the canonical instance. ► We provide a comprehensive evaluation of the algorithms to verify their effectiveness and scalability.

Introduction

An important issue in modern information systems and e-commerce applications is to provide support for inter-operability of independent data sources. In data exchange, data structured under one schema (source schema) are restructured and translated into an instance of a different schema (target schema). The restructuring should follow a specification, known as a schema mapping, which describes the relationship between the source schema and the target schema. Data exchange is used in many tasks that require data to be transferred between existing and independently created applications, often in e-business applications. Two applications can exchange data by directly transferring data following one schema to the other schema. Alternatively, several applications can agree on a standard data schema, and then exchange data between them by converting their own data to/from this form.

The key tasks in data exchange can be roughly divided into two groups.

  • 1.

    Intensional tasks, which concern the management of the schema mapping between the source and target schema. Among others, these tasks involve (a) the design of the mapping language, to assess the trade-off between the expressive power and the efficiency of implementation, (b) the static analysis of the mapping, e.g., the consistency of a mapping and the containment of different mappings, and (c) the operations of the mappings, e.g., the inversion and composition.

  • 2.

    Extensional (data-level) tasks, for the defined mapping between the source and target schema and the given source instance. These problems mostly involve (a) the construction (materialization) of target instances, and (b) query answering against the target schema, in a way that is consistent with the source data.

Data exchange has received an increasing attention from both the research community and the tool market, due to the extensive need for exchange of data with the prevalent use of the Web. After the theoretical foundation of relational data exchange is laid (Fagin et al., 2005, Fagin et al., 2005), various issues in data exchange are studied (Barcelo, 2009, Bernstein and Melnik, 2007). Practical data exchange system Clio (Miller et al., 2001, Popa et al., 2002) is also built and partly incorporated into commercial systems.

Most existing researches have focused on the relational model, thus much less is known about XML data exchange. While commercial systems often claim to provide support for XML, this is typically implemented either through relational translations, or with simple mappings that cannot change document structure. XML is becoming one of the dominating standard for information representation and interchange on the Web; this motivates the quest for effective methods to handle the ever-growing XML data with data exchange settings.

This article studies the problem of XML data exchange, to find an instance of a target schema for an instance of a source schema, where both schemas are given by DTDs. Among other things, we identify two key factors in an effective XML data exchange: (a) a rich language for XML schema mapping, to support the document navigation and restructuring, and to bind semantically related values and (b) a powerful constraint model for target constraints, to improve the semantic expressiveness of XML, and to help achieve better target instance.

Example 1

To illustrate the features that need to be modeled in an XML data exchange, consider the source and target DTD schemas given in Fig. 1. Both DTDs describe a set of authors and their publications, and the meanings of labels are self-explanatory. In the DTD graph, (a) each node denotes an element type; (b) an edge (possibly labeled with “∗”) from a node A to a node B shows the parent–child relationship between the two element types; and (c) the list of attributes of an element type is given in bracket.

Consider the source instance shown in Fig. 1. For ease of reference, we decompose the source instance into a number of fragments, denoted by dashed lines. The relationships between publications and their authors are given in different ways. (a) In fragments 1 and 2, authors and publications are listed separately, which can be joined together on attribute @pid in a key to foreign-key style. (b) Fragment 3 denotes the title of a book, whose authors are listed in fragments 4 and 5. The relevancy of fragments 3–5 is explicitly stated, since they are in a same subtree rooted at a book element. (c) Fragment 6 (resp. 7) lists the title and author of a paper in a subtree rooted at a paper element.

Fig. 1 then presents a canonical target instance. As remarked earlier, the construction of the target instance should follow the schema mapping from the source schema to the target schema. The details of the schema mapping will be discussed in Section 3, and here we only preview the requirements of this practical mapping. (a) The target instance is a complete restructuring of the source one; it is now organized in authors (element writer), while the source instance is mostly organized in publications (elements book and paper). (b) The target fragment 1 results from the join of source fragments 1 and 2 on attribute @pid. (c) The target instance is selective, in the sense that the target instance only collects part of the source data. Observe that the selection condition is based on data values; source fragment 7 has the same structure as fragment 6, but only fragment 6 is chosen to form a result in the target instance (target fragment 4). (d) Some labeled nulls (denoted by variables) are introduced in the target instance, e.g., variable r1 for the value of attribute @year in target fragment 2. This implies that the involved mapping from the source schema to the target schema is incomplete, i.e., the source instance cannot provide all the required data for the target instance, these nulls are therefore introduced to represent unknown or missing values. (e) Some constants are provided as node values in the process of data exchange, e.g., “book” for the value of attribute @type in fragment 3. Intuitively this helps classify the publications in the target; different element types, e.g., book, paper are used for this purpose in the source. As will be addressed in Section 4, the techniques for introducing nulls and constants in the target instance are actually quite different.

The mapping language that meets all the above mentioned requirements is certainly a powerful one, however, it still falls short of expressing some target semantics required in real applications. Consequently, one can observe that the target instance is far from satisfactory. (a) Intuitively, variable r1 (resp. s2) should be equal to r2 (resp. s3), since each book has a deterministic publication year (resp. source). However, this generally cannot be implemented by leveraging skolem function (Fagin, Kolaitis, Popa, & Tan, 2005), which takes as input the mapping rule and related source values, to produce a labeled null value. Since the source values related to r1 and r2 are (“tom”, “b1”, “MIT”) and (“smith”, “b1”,“Oxford”), respectively, two distinct nulls are generated by existing techniques. (b) Writer named “tom” has multiple publications in the instance. One might be tempted to organize them together since this is allowed by the target schema (one writer node can have multiple work nodes as its children). (c) After all the publications of writer “tom” are collected, one might consider merging the two papers from fragments 1 and 4, since they have the same title. If the two papers are merged, then this will identify variable r3 (resp. s1) with constant “1998” (resp. “c1”). The reason is that each work has exactly 1 year (resp. source), according to the structural constraints imposed by target DTD.

With these comes the need for incorporating more target semantics. (a) If we know the publication year (resp. source) of a book is determined by its title, then we can conclude that r1 and r2 (resp. s2 and s3) are equal. (b) To organize together the publications of writer named “tom”, the writer’s name must be defined as a key, otherwise we are unsure whether the multiple “tom” refer to the same person. (c) For each writer, if the combination of type and title of his work is unique, then we can also merge works by leveraging this.

All the above semantics can be expressed in constraints (dependencies); this motivates us to employ target constraints in the process of data exchange. Target constraints play a crucial role in the generation of good solutions, since they are essential part of data semantics, and should be satisfied by the target instance. With all the mentioned constraints available, we get a much better target instance in Fig. 1. 

This work investigates target instance construction for XML data exchange. Given (a) a source instance S, (b) an XML data exchange setting (DS, DT, ΣST), where DS, DT and ΣST are the source schema, the target schema, and the mapping between source and target schemas, respectively, and (c) a set Σ of target constraints, our goal is to compute a solution of S w.r.t. the data exchange, by capitalizing on the data exchange setting and target constraints.

  • 1.

    Based on tree patterns and equality formulae, we present a rich language for the definition of schema mapping, i.e., ΣST, which allows one to restructure documents by leveraging variety forms of navigation and comparisons of data values.

  • 2.

    For a given XML data exchange setting and a source instance, we propose an algorithm to construct a canonical solution. Skolem function strategy and introduction of default element nodes are incorporated in the generation of the canonical instance, among other things.

  • 3.

    We employ target constraints to refine the canonical instance. (a) We present a general XML constraint model to define target constraints. Besides the commonly discussed functional dependencies and keys, the constraint model also extends the recently introduced conditional functional dependencies to XML. (b) We leverage key constraints, together with the structural constraints imposed by DTD, to identify a certain entity. The information about the same entity is then merged; this helps improve data compactness. (c) We provide a chase method to reason about missing data based on non-key constraints, by identifying a null with a constant or by enforcing the equality of two nulls; this helps improve data preciseness.

  • 4.

    The techniques developed in this article are implemented. We provide a comprehensive evaluation of the algorithms to verify their effectiveness and scalability, using both real-life and synthetic data. The experimental results demonstrate that our algorithms scale well, and help generate good solutions. We contend that our techniques yield a promising method for the construction of XML data exchange solutions.

We next discuss related works.

The notion of a data exchange problem (Fagin et al., 2005) and the properties of core solutions (Fagin et al., 2005) are formally studied. Following these, the problem of data exchange has been actively studied in the past few years. However, most researches on this topic occur in the relational settings (Barcelo, 2009, Fagin et al., 2005, Fagin et al., 2005), or slight extensions (Hernandez et al., 2007, Yu and Popa, 2004).

Most relational data exchange systems fall short of handling target constraints. Although these constraints are recognized as an important feature of data exchange, they introduce a number of subtleties in the computation of target instances. A recent work (Marnette, Mecca, & Papotti, 2010) identifies some practical cases, where it is possible to compute solutions for data exchange with target equality-generating dependencies, and develops a best-effort algorithm for this. Another work (Gottlob, Pichler, & Savenkov, 2011) proposes a mapping rewriting algorithm that considers target quality-generating dependencies; it aims at minimizing the constraints to make them easier to handle. Our work clearly differs from those relational ones, since the XML data model, the mapping language between DTDs, and the XML constraints are much more complicated than the relational counterparts. We have to invent most basic notions and present new techniques.

The first study of XML data exchange (Arenas & Libkin, 2008) is quite limited due to the simple mapping language used; the mappings proposed do not allow even the simplest joins. A more expressive language for XML schema mapping is then proposed (Amano, Libkin, & Murlak, 2009), and is adopted to study the complexity of query answering problems (Amano, David, Libkin, & Murlak, 2010). The XML data exchange is also addressed recently with a different setting of mappings (Bojanczyk, Kolodziejczyk, & Murlak, 2011). The prior works mainly concentrate on the complexity of static analysis problems, e.g., deciding if all instances of the source schema can be mapped to an instance of the target schema (absolute consistency); deciding if a given instance of the source schema can be mapped (solution existence). The problem these papers totally neglect is how to construct target instances for XML data exchange. This is important in data exchange tasks, but so far no good algorithms exist, even for the simple mappings (Arenas & Libkin, 2008). We argue that besides decision problems related to XML data exchange, it is also important to study efficient methods for the generation of target instance, to bridge the gap between theory and practice. In this article, we identify some practical settings for efficient target instance constructions (in polynomial time), on the basis of tractable cases (Amano et al., 2009, Arenas and Libkin, 2008).

We also find that the mapping languages proposed so far fail to meet the needs of real applications. Consider Example 1 again, filtering conditions must be specified to qualify only source fragment 6 in the generation of target instance, similar to a where condition in a SQL query. However, neither of the current mapping languages (Arenas and Libkin, 2008, Amano et al., 2009, Bojanczyk et al., 2011) can meet this basic requirement. In this article, we employ an expressive mapping language to define schema mappings, and develop techniques to construct a target instance. Moreover, based on a powerful constraint model, we capitalize on target constraints in the process of data exchange; this significantly improves the quality of data exchange solutions, which is demonstrated by our experimental studies. To our best knowledge, neither of the former works has considered target constraints in the XML data exchange setting.

The XML Stylesheet Language for Transformations (XSLT) can be employed to implement some basic data exchange works. However, it is clear that XSLT cannot handle the join operation commonly required in a practical mapping, not to mention the involvement of target constraints.

An alternative approach to handling XML data exchange is to employ relational data exchange and transformation techniques between XML and relations. Specifically, this approach can be described as follows: XML data is first transformed to relations on which relational data exchange technique is applied, and then the result of relational data exchange is transformed to XML data. A prerequisite for this approach is that an XML data exchange setting can be properly and completely expressed in a corresponding relational one. This is, however, highly non-trivial and not always possible, due to the mismatch between XML and relations in, e.g., data model, query language and constraints, among others. Therefore, such method can generally only be applied to some restricted settings.

In data exchange, data is extracted from a source instance, and results in a target instance with a different schema. The data extraction and restructuring tasks in data exchange are generally conducted routinely. Therefore, the definition and implementation of a data exchange rely on the already known source and target schemas, and the pre-defined mapping between the source and target schema. Data exchange is widely used in e-business applications, where the source and target schemas are well understood, and is well suited for the case when certain standard data formats exist in some domains. Moreover, some systems (Miller et al., 2001, Popa et al., 2002) have been developed to (semi-)automatically generate the mapping between the schemas, alleviating the need for fully understanding the transformation language.

Note that data exchange is not to provide a general method to convert arbitrary data on the web, when the contents and structures of data sources are unknown or changed frequently. Indeed, even the widely used languages for querying XML data, e.g., XPath and XQuery, are not well-suited to cope with XML data without schema, since users must have some knowledge about the structure of XML documents to write effective path expressions. XML keyword search (Xu & Papakonstantinou, 2005) is introduced to query arbitrary XML data, without needing prior knowledge of the structure of the underlying data. However, keyword search can only be employed to extract semantically related data, but falls short of restructuring data.

A different approach to data restructuring, referred to as data harmonization, is introduced (Niemi, Näppilä, & Järvelin, 2009). In contrast to data exchange, data restructuring in data harmonization is not based on a mapping from the source schema to the target schema. The goal of data harmonization is to satisfy ad hoc information needs, and hence the schemas are supposed to be unknown. Data harmonization first converts XML data into a relational representation, called an XML relation. An important feature of XML relation is that there is an unambiguous mapping between a textual XML document and its XML relation representation. A Query language (Näppilä, Moilanen, & Niemi, 2011) based on the XML relation is also proposed, which can extract, select, rename and restructure XML data.

Note that data exchange differs from data harmonization in their basic assumptions. XML Data exchange can generally be implemented more efficiently by capitalizing on the already known schemas, and does not require XML data to be converted into a relational expression. Moreover, although XML relations are compatible with ordinary relations, they are inherently different. A special indexing schema is embedded in XML relations for the manipulation of structural aspects of XML data, and automatic re-indexing mechanism is also required when data are restructured. Therefore, existing relational data exchange techniques cannot handle this kind of relations directly.

Section 2 presents the preliminary definitions. The XML data exchange setting is studied in Section 3. Section 4 provides an algorithm to construct the canonical target instance for data exchange. We give an XML constraint model and employ target constraints to refine canonical instance in Section 5. The experimental results are given in Section 6, followed by conclusions and future work in Section 7.

Section snippets

Preliminaries

We review some related preliminary definitions in this section.

XML data exchange setting

Recall that a relational data exchange setting (Fagin et al., 2005) is a triple (S, T, ΣST), where S and T are source and target relational schemas, and ΣST is a set of source-to-target dependencies. Using x¯ to denote a list of variables, each dependency in ΣST is given in the form φs(x¯,y¯)ψt(x¯,z¯), where φs (resp. ψt) is a conjunction of atoms over S (resp. T), x¯,y¯ (resp. x¯,z¯) are free variables in φs (resp. ψt), and x¯ are common variables in the source and target. Given a source

Canonical target instance construction

Given an XML data exchange setting (DS, DT, ΣST) and a source instance S conforming to DS, as remarked earlier, the solution of S w.r.t. the data exchange is usually not unique. In this section, we present an algorithm, denoted by Canonical, to construct a canonical target instance T as the “standard” solution of S. A canonical target instance exists if and only if there exists at least a solution of S. Intuitively, a canonical target instance carries only the necessary data required by

Incorporating target constraints

In this section, we capitalize on target constraints to refine the canonical instance. More specifically, target constraints are employed to (a) identify and merge partial data of the same entity; and (b) reason about missing values (labeled nulls). We first present an XML constraint model for the definition of target constraints, then develop techniques for (a) and (b).

Experimental study

We present an experimental study using both real-life and synthetic data. Two sets of experiments are conducted to verify (a) the effectiveness of target constraints and (b) the efficiency and scalability of our algorithms for the computation of data exchange solutions.

Conclusions and future work

In this article, we consider the problem of target instance construction for XML data exchange. We have presented an algorithm to construct a canonical target instance, where source-to-target dependencies are defined in a rich mapping language. In addition, based on target constraints, we have developed techniques to merge partial data of the same entity, and to reason about missing data using chase. Our experimental results have verified the scalability and effectiveness of our methods.

This is

Acknowledgments

The authors thank anonymous reviewers for their useful suggestions. This work is supported by the National Natural Science Foundation of China.

References (30)

  • P. Buneman et al.

    Reasoning about keys for XML

    Information Systems

    (2003)
  • R. Fagin et al.

    Data exchange: Semantics and query answering

    Theoretical Computer Science

    (2005)
  • Amano, S., Libkin, L., & Murlak, F. (2009). XML schema mappings. In Proceedings of the 28th ACM symposium on principles...
  • Amano, S., David, C., Libkin, L., & Murlak, F. (2010). On the tradeoff between mapping and querying power in XML data...
  • S. Amer-Yahia et al.

    Tree pattern query minimization

    The VLDB Journal

    (2002)
  • M. Arenas et al.

    A normal form for XML documents

    ACM Transactions on Database Systems

    (2004)
  • M. Arenas et al.

    XML data exchange: Consistency and query answering

    Journal of the ACM

    (2008)
  • P. Barcelo

    Logical foundations of relational data exchange

    SIGMOD Record

    (2009)
  • Bernstein, P., & Melnik, S. (2007). Model management 2.0: manipulating richer mappings. In Proceedings of the ACM...
  • Bohannon, P., Fan, W. F., Geerts, F., Jia, X., & Kementsietsidis, A. (2007). Conditional functional dependencies for...
  • Bojanczyk, M., Kolodziejczyk, L., & Murlak, F. (2011). Solutions in XML data exchange. In Proceedings of the 14th...
  • Buneman, P., Davidson, S., Fan, W. F., Hara, C., & Tan, W. C. (2001). Keys for XML. In Proceedings of the 10th...
  • R. Fagin et al.

    Data exchange: Getting to the core

    ACM Transactions on Database Systems

    (2005)
  • R. Fagin et al.

    Composing schema mappings: Second-order dependencies to the rescue

    ACM Transactions on Database Systems

    (2005)
  • Fan, W. F. (2008). Dependencies revisited for improving data quality. In Proceedings of the 27th ACM symposium on...
  • Cited by (0)

    View full text