Xandy: A scalable change detection technique for ordered XML documents using relational databases

https://doi.org/10.1016/j.datak.2005.06.006Get rights and content

Abstract

Previous work in change detection to XML documents is not suitable for detecting the changes to large XML documents as it requires a lot of memory to keep the two versions of XML documents in the memory. In this article, we take a more conservative yet novel approach of using traditional relational database engines for detecting the changes to large ordered XML documents. To this end, we have implemented a prototype system called Xandy that converts XML documents into relational tuples and detects the changes from these tuples by using SQL queries. Our experimental results show that the relational-based approach has better scalability compared to published algorithm like X-Diff. It has comparable efficiency and result quality compared to X-Diff in some cases. Our experimental results also show that, generally, Xandy has better result quality than XyDiff.

Introduction

Over the next few years XML is likely to replace HTML as the standard format for publishing and transporting documents over the Web. The Web allows these documents to change at any time and in any way. These changes typically take two general forms. The first is existence. XML pages exhibit varied longevity pattern. The second is structure and content modification. An XML document replaces its antecedents, usually leaving no trace of the previous document. These rapid and often unpredictable changes to the information create a new problem of detecting and representing these changes (hereafter called XML deltas or XDeltas). Such a change detection tool is important to incremental query evaluation, trigger condition evaluation, search engine, data mining applications, and mobile applications [4], [15].

Even though the underlying challenge is how to detect and represent the changes to large volume of data, the novel context of the XML forces us to significantly extend traditional techniques. XML data is commonly modelled by a tree structure (hereafter called XML tree), where nodes represent elements, attributes and text data, and parent–child pairs represent nesting between XML elements. The XML trees are classified into ordered trees and unordered trees. An ordered tree is one in which both the ancestor relationships and the left-to-right ordering among siblings are significant. An unordered tree is one in which only ancestor relationships are significant. In this article, we focus on ordered XML documents.

The changes to ordered XML documents can be classified into two types: changes to internal elements and changes to leaf elements. An internal element does not contain textual data. For example, consider two versions of an XML document in Fig. 1. For the time being, ignore the dotted boxes. The nodes 3 and 9 in Fig. 1(a) are internal elements. The changes to internal elements are called structural changes as they modify the structure of the document but do not change the textual data content. We consider the following types of structural changes: internal element insertion, internal element deletion, and internal element movement. For instance, node 114 in Fig. 1(b) is an example of internal element insertion. Node 15 (node 103 in Fig. 1(b)) is moved from being the fourth child of node 1 to be the second child of node 101. Note that node 101 in T2 is the corresponding node of node 1 in T1. A leaf element is an element/attribute which contains textual data. For example, node 4 is a leaf element which has name “name” and textual content “Smith”. The changes to leaf elements are called content changes as they modify the textual data content. We consider the following four types of content changes: leaf element insertion, leaf element deletion, content update of a leaf element, and leaf element movement. For example, a leaf element “Interest” (id = 108) which has value “Information Retrieval” is an inserted leaf element. In this article, we present novel techniques for detecting the content and structural changes in ordered XML documents using relational databases.

The XML change detection problem is related to the problem of change detection to trees. In [2], the authors address the problem of detecting changes to two snapshots of hierarchically structured information that are represented as ordered trees. MH-Diff [1] is an efficient algorithm for meaningful change detection between two unordered trees. The authors introduce the following matching criteria to compare nodes, and the matchings between two versions of a tree are determined based on this assumption.

Given two labeled trees, T1 and T2, there is a “good” matching function, so that given any leaf s in T1, there is at most one leaf in T2 that is “close” enough to match s.

The faster version of the matching algorithm uses longest common subsequence computations for every element node starting from the leaves of the document. The algorithm runs in time O(ne + e2), where n is the total number of leaf nodes, and e is a weighted edit distance between the two trees. This assumption holds well for many SGML documents that do not contain duplicate or similar objects, but it does not hold for many XML documents.

Recently, a number of techniques for detecting the changes to XML data has been proposed. Most of these techniques focus on developing main memory algorithm to detect the changes. XMLTreeDiff [5] and XyDiff [4] are designed for detecting the changes to ordered XML documents. In XyDiff, the changes are detected by using signatures and weights of nodes. For each node in a XML DOM tree, the signature is computed using the nodes content and its children signatures. Simultaneously, the weight is computed for each node, based on the size of its content for text nodes and the sum of the weights of its children for element nodes. The change detection starts from finding a matching between the heaviest nodes. Note that the heavier subtree will have higher priority to be chosen for comparison. Once a match is found, it is propagated to the ancestors and descendants nodes to get more matchings. Inserts, deletes and moves are computed after all exact matches are found. XMLTreeDiff (a tool developed by IBM) is a set of JavaBeans and does ordered tree-to-tree comparison to detect the changes to XML documents by using DOMHash [10]. X-Diff [15] is designed for computing the XDeltas for two unordered XML documents. The main strength of X-Diff algorithm is that it reduces the mapping space significantly and helps the algorithm to achieve polynomial time in complexity. However, the change detection response time is slower than XyDiff. XMLTreeDiff [5], XyDiff [4], and X-Diff [15] are the memory-based approaches as they parse both versions of XML documents and detect the changes to these documents that are in the main memory.

The above memory-based approaches have some limitations as follows. First, they require the entire trees (i.e., DOM trees) of two XML documents to be memory resident. This problem is exacerbated by the fact that these trees are typically much larger than their XML documents [9]. Thus, the scheme is not scalable for very large XML documents. In fact, the scheme is inefficient. We need to parse an XML document whenever we want to compare it with a new version. That is, if a document is compared with more than one document at different times, then it has to be parsed multiple times.

There has been a substantial research effort in storing and processing XML data. The relational storage approach has attracted considerable interest with a view to leveraging their powerful and reliable data management services. The above limitations coupled with the recent success in storing XML data in relational databases [6], [7], [12], [13], [17] force us to ask whether we can address these problems by using relational techniques to detect the changes to XML documents. A relational database can be used in two ways to address the change detection problem. Let us elaborate on this further. Suppose source A sends a XML document D1 (version 1) at time t1 to source B. B stores D1 in its local RDBMS. At time t2, A modifies D1 to D2 (version 2) and sends it to B. B can now detect the changes to the document in the following two ways.

  • (a)

    B extracts D1 from the relational database and compares it to D2 (before inserting D2 into the database) by using any one of the above memory-based change detection approaches.

  • (b)

    B first stores D2 in the relational database and then detects the changes to the documents by executing a set of SQL queries whenever appropriate.

In the first approach, the costs incurred are the extraction time of D1 and the change detection time of the memory-based algorithms. However, as mentioned earlier, these algorithms are not scalable. Furthermore, the extraction cost is incurred every time we wish to compare D1. The costs incurred by the second approach are the time taken to insert D2 into the database and the change detection time in the database. In particular, by storing XML documents as tables, we can filter out tuples and attributes that are not needed. Second, the system using this approach is more scalable as it can handle very large XML documents that may not fit into the main memory. Third, by storing XML in RDBMS, we only need to parse the XML documents once and then we can find the changes by issuing SQL queries against the database. Finally, implementing a change detection algorithm in SQL makes the programming task easier. Also, as SQL is an industry standard and available on all major RDBMS, the implementation of the change detection technique is portable.

As the relational storage approach for storing and managing XML data has gained popularity, we believe that the second approach is an attractive option if it can address the following two issues. First, the insertion and extraction times for D1 and D2 should be comparable. In other words, the underlying relational storage structure must support efficient insertion and extraction of XML documents. Second, we must be able to detect all types of changes accurately.

In our preliminary efforts in [3], [8], we have demonstrated that it is indeed possible to use the relational database to detect the changes to ordered XML documents. However, the approaches in [3], [8] focused on the content changes only and did not detect the structural changes. The underlying relational schema of DiffXML [3] is simplistic and is not efficient for path expressions query processing. Hence, our approach in [8] uses SUCXENT schema that enables us to insert, extract, and query XML data efficiently [12].

In this article, we present a novel relational-based approach called Xandy (Xml enAbled chaNge Detection sYstem) for detecting the both content and structural changes to ordered XML documents. Given T1 and T2 as the old and new versions of an XML document respectively, first, we store both documents in the relational database. After the documents are stored in the relational database, we are ready to detect the changes between T1 and T2. There are two phases in our approach to detect the changes between T1 and T2 as follows:

  • (1)

    Find the best matching subtrees. The objective of this phase is to find the most similar subtrees in T1 and T2. In this phase, we try to match the subtrees in T1 to ones in T2. Some of the subtrees in T1 can be matched to more than one subtree in T2 and vice versa. We measure the similarity of each matching subtrees by calculating the similarity score of these matching subtrees. The most similar subtrees are called best matching subtrees. The top-down approach starts computing the similarity scores from the root nodes of T1 and T2, and move downward. In the bottom-up approach, we start matching the root nodes of the subtrees rooted at the lowest level, and move upward. We shall see that the bottom-up approach is, on average, five times faster then the top-down approach. We also shall see that the result quality of the bottom-up approach is better than the one of the top-down approach.

  • (2)

    Detect the changes. In this phase, we use the information on best matching subtrees in order to detect the types of changes as discussed above by issuing SQL queries. First, we determine the changes on internal nodes (both insertions and deletions). Next, the inserted and deleted leaf nodes are detected. Finally, we detect updated leaf nodes and moved nodes. The XDeltas are stored in the relational tables.

In summary, this article makes the following contributions:

  • We propose a novel technique to detect the changes, both structural and content changes, to the ordered XML documents by using relational databases. The relational-based approach is able to overcome the scalability problem that occurs on the memory-based approach.

  • By extending a published relational schema called SUCXENT [12], Xandy is efficient not only for detecting the changes, but also for inserting, extracting, and querying XML data as it inherits the features of SUCXENT. In [12], the authors have shown that the execution time of insertion and extraction XML documents by using SUCXENT schema are comparable.

  • An extensive performance study was conducted on our approaches. The experimental results show that the relational-based approach is more scalable than the memory-based approaches.

The organization of the rest of this article is as follows. In Section 2 we shall briefly discuss the relational schema that we use for storing the XML documents. In Section 3, we discuss how we are able to find the best matching subtrees from two given versions of an XML document. We present the algorithms for the top-down approach and the bottom-up approach. We shall elaborate how the XDeltas can be discovered in Section 4. We also present the SQL queries that are used to discover the XDeltas. In Section 5, we compare the performance of different approaches. Finally, we conclude the article in the last section.

Section snippets

Background

There are two approaches for storing XML documents in relational database: the model-mapping approaches [6], [7], [12], [17] and the structure-mapping approaches [13]. The model-mapping approaches maintain a fixed schema which is used to store XML documents irrespective of their schemas. The structure-mapping approaches first create a relational schema based on the schemas of XML documents. In this article, we also adopt the model-mapping approach due to the following reasons. First, the DTD or

Finding best matching subtrees

In this section, we shall elaborate how to find the best matching subtrees. The objectives of finding the best matching subtrees are to enable us to get the minimum XML delta. The minimum XML delta can be defined as the delta which has the least number of edit operations (types of changes).

Suppose we have two XML trees, T1 and T2, as depicted in Fig. 1. There are more than one XDelta that can be detected from T1 and T2. For example, we may have an XDelta that contains seven updates and a

Detecting the changes

After we are able to identify best matching subtrees in T1 and T2, we are ready to detect the changes between T1 and T2 by using the information on the best matching subtrees. There are seven types of changes considered in this article: insertion of internal nodes, insertion of leaf nodes, deletion of internal nodes, deletion of leaf nodes, content update of leaf nodes, move among siblings, and move to different parent nodes. In this section, we shall discuss the concepts that we shall use in

Experimental results

In this section, we examine the performance of Xandy approaches. The top-down and bottom-up approaches are implemented in Java. We ran the experiments on a Microsoft Windows 2000 Professional machine having Intel Pentium 4 1.7 GHz processor with 512 MB of memory. The database system we used was IBM DB2 UDB 8.1. We create two databases, one is for the top-down approach, and another is for the bottom-up approach. We specify the query workload to the Design Advisor, and the indexes on the

Conclusions

The relational-based approach for ordered XML change detection system in this article is motivated by the scalability problem of existing memory-based approaches. We have shown that the relational approach is able to handle XML documents that are much larger than the ones detected by using main-memory approaches. We also report on the performance of two relational approaches in Xandy, the top-down and the bottom-up approaches, on two different kinds of data sets, the data-centric and the

Erwin Leonardi is a Ph.D. student in Computer Engineering at Nanyang Technological University, Singapore. He received my B.Sc. in Computer Science from Bina Nusantara University, Jakarta, Indonesia. His research interest includes change management and XML data management.

References (17)

  • S. Chawathe, H. Garcia-Molina, Meaningful change detection in structured data, in: Proceedings of the ACM SIGMOD,...
  • S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom, Change detection in hierarchically structured information, in:...
  • Y. Chen, S. Madria, S.S. Bhowmick, DiffXML: Change detection in XML data, in: Proceedings of the 9th International...
  • G. Cobena, S. Abiteboul, A. Marian, Detecting changes in XML documents, in: Proceedings of the 18th International...
  • Curbera, D.A. Epstein, Fast difference and update of XML Documents, XTech’99, San Jose,...
  • D. Florescu et al.

    Storing and Querying XML Data Using an RDMBS

    IEEE Data Engineering Bulletin

    (1999)
  • H. Jiang, H. Lu, W. Wang, J. Xu Yu. Path materialization revisited: an efficient storage model for XML data, in:...
  • E. Leonardi, S.S. Bhowmick, T.S. Dharma, S. Madria, Detecting content changes on ordered XML documents using relational...
There are more references available in the full text version of this article.

Cited by (13)

  • Measuring the quality of diff algorithms: A formalization

    2016, Computer Standards and Interfaces
    Citation Excerpt :

    The dimension of a delta has been widely used to evaluate the quality of diff algorithms, as discussed in Section 2. Two main approaches have been adopted: space consumption [6,11] and edit distance [4,12]. There are various definitions of edit distance; the most basic one is defined as the number of changes needed to transform one document into another.

  • Measuring the domain-oriented quality of diff algorithms

    2014, 20th IMEKO TC4 Symposium on Measurements of Electrical Quantities: Research on Electrical and Electronic Measurement for the Economic Upturn, Together with 18th TC4 International Workshop on ADC and DCA Modeling and Testing, IWADC 2014
View all citing articles on Scopus

Erwin Leonardi is a Ph.D. student in Computer Engineering at Nanyang Technological University, Singapore. He received my B.Sc. in Computer Science from Bina Nusantara University, Jakarta, Indonesia. His research interest includes change management and XML data management.

Sourav S. Bhowmick received his Ph.D. in Computer Engineering in 2001. He is currently an Assistant Professor in the School of Computer Engineering, Nanyang Technological University. His current research interests include XML data management, data integration, web mining, and biological data management. He has published more than 80 papers in major international database conferences and journals such as VLDB, IEEE ICDE, ACM CIKM, ICDCS, DEXA, IEEE Transactions on Knowledge and Data Engineering, ACM Computing Survey, Information Systems, and Data and Knowledge Engineering Journal. He is serving as a PC member of various database conferences and workshops and reviewer for various database journals. He is the program chair of the International Workshop on Biological Data Management (BIDM) since 2003. He is the Guest Editor of a Special Issue on Biological Data Management for the Data and Knowledge Journal. He also serve in the editorial boards of International Journal of Digital Information Management (JDIM) and International Journal of Data Warehousing and Mining (JDWM). He has co-authored a book entitled “Web Data Management: A Warehouse Approach” (Springers Verlag, October 2003). He is member of ACM and IEEE.

View full text