Elsevier

Information Sciences

Volume 210, 25 November 2012, Pages 41-54
Information Sciences

XML filtering with XPath expressions containing parent and ancestor axes

https://doi.org/10.1016/j.ins.2012.04.035Get rights and content

Abstract

More and more XML data is generated and used for data exchange. In this paper, we address the problem of filtering XML documents with large number of XPath expressions, which may contain ‘ancestor’ and ‘parent’ axes. XPath expressions with these axes are more powerful and flexible for users to describe their interests in publish/subscribe systems. First, we analyze the characteristics of the ‘parent’ axis and propose a series of rules to eliminate it in XPath expressions. Then we propose a new index structure called NIndex, which is designed to efficiently store and index large number of XPath expressions. NIndex offers several features which make it especially attractive for the large scale selective dissemination of information, including the ability to handle complex XPath expressions with ‘ancestor’ and ‘parent’ axes, and efficient pruning. Based on NIndex, we design a new filtering algorithm with low complexity for our problem. Our experiment results show that our algorithm performs well across a range of XPath expressions and documents.

Introduction

The proliferation of the Internet, and the exploding volume of information available on the Internet has fueled the development of a wide range of new applications based on selective dissemination of information (SDI) [3]. These applications include stock exchange, advertisement systems, electronic personalized newspapers, online shopping, online auctioning and entertainment delivery, and require timely distribution of data to a large set of customers. In an SDI (or publish/subscribe) system, users subscribe to a data server with continuous queries or profiles that are expressed in some well-defined languages for expressing their information needs. The SDI system performs the matching task and ensures timely delivery of published data to all interested subscribers.

With XML becoming the standard of data representation and exchange on the Internet, effective and efficient methods have been studied for searching useful information from ordinary and probabilistic XML documents by both structured queries and keyword queries [2], [7], [25], [26], [28], [30], [33], [42]. XML is also adopted for content-based publish/subscribe systems because published XML messages have flexible document structures and subscription rules can be expressed by a powerful language such as XPath [9] and XQuery [5]. In an XML publish/subscribe system, XML filtering is a core part, in which continuously arriving XML documents are routed to users according to their subscriptions expressed as queries. These queries typically specify patterns of selection on multiple elements.

Tremendous effort has been put into efficiently filtering XML streams with a collection of XPath expressions [3], [8], [12], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [32], [36], [41], including XFilter [3], YFilter [12], XTrie [8], XPush [18] and the lazy filtering algorithm (LF) [16]. Most of these systems only support forward axes such as child and descendant axes.

However in real applications, XML queries submitted by users may contain predicates with ancestor and parent axes. Firstly, documents for dissemination usually have different schemas [1], [29], [39], consequently different relationships between ancestor tags of a certain tag t in a query may exist in different documents. Sometimes, users are unconcerned about exact relationships between those ancestor tags as long as they are ancestors of tag t. Moreover, users may not always be aware of the schemas of XML documents. For the above reasons, users have to use the predicates with ancestor or parent axes to constrain the answers.

For example, Alice wants to buy some books from a book selling website. She is interested in those computer books authored by “John”, and published by “Addison-Wesley”. The documents (a) and (b) shown in Fig. 1 all satisfy the Alice’s request, and therefore both of them should be disseminated to Alice. This can be realized by submitting the following query Q with an ancestor axis easily.Q://Computer//book[ancestorAddison-Wesley]//author[name=John]However, if Alice does not use the ancestor axis and submits the following query Qa instead.Qa://Computer//Addison-Wesley//book//author[name=John]then only document (a) will be returned because the forward axis in Qa can express only one kind of relationship between the ancestor tags “Computer” and “Addison-Wesley” of the book tag, i.e. “Computer” is the ancestor of “Addison-Wesley”. Document (b) will not be returned because the ancestor tags “Computer” and “Addison-Wesley” have a different relationship, i.e. “Addison-Wesley” is the ancestor of “Computer”. From the example, we can clearly see that using the ancestor axis provides more expressive power than not using it. For improving flexibility, we also support the parent axis.

In this paper, we propose a new index structure named NIndex, which supports efficient filtering of XML documents based on XPath expressions (XPEs) which contain ancestor and parent axes. NIndex structure offers several novel features that make it especially attractive for large-scale publish/subscribe systems.

Firstly, we analyze the characteristics of the ‘parent’ axis and propose a series of rules to eliminate it from those XPEs with the ‘parent’ axis. Dan Olteanu et al. proposed rewriting rules [34] to transform absolute XPath location paths with reverse axes into equivalent reverse axis free paths, and to rewrite all axes to a minimum set of forward axes. However, the transformed XPEs contain join operations, which are not convenient to handle in an XML document filtering system. In this paper, we only consider transformation of the ‘parent’ axis and transformed XPEs contain only simple axes including ‘ancestor’,‘descendant’ and ‘child’.

Secondly, to improve filtering performance, our NIndex scheme captures common sub-expressions among multiple XPEs. Xtrie [8] is indexed on a set of substrings by using a substring table in which levels of elements are recorded. However, the level attribute may cause redundancy despite the identical sub-patterns. Different occurrences of the same sub-pattern are treated as different items in the substring table if they appear at different levels. Therefore our index is more concise to represent XPEs. It reduces the number of unnecessary index probes and avoids redundant matchings.

Thirdly, NIndex is designed to support effective filtering of XML documents based on complex XPEs containing ancestor and parent axes. Xaos [4] is a query processing system that also supports ancestor and parent axes, and it uses the X-dag structure to express queries. However the X-dag structure can only handle a single XPE, and does not provide support to handle the case of multiple XPEs.

Finally, the experiments show that our algorithm is time-efficient.

The remainder of this paper is organized as follows. In Section 2, we introduce the background and give the definition of a PXPE-tree. Section 3 discusses how to eliminate the parent axis in a PXPE-tree. In Section 4, we introduce the structure of NIndex which is the index for XPEs, and design updating algorithms for maintaining NIndex. In Section 5, we propose a filtering algorithm to address the XPE retrieval problem, which is based on NIndex. Section 6 shows our experimental results. Section 7 discusses related works and compare them with our work. Section 8 concludes the paper.

Section snippets

Data model for XML streams

An XML document can be modeled as a rooted, labeled, and ordered tree, which we call an XML data tree. Each node in the data tree corresponds to an element, attribute, or text value in the XML document. An XML streaming algorithm accepts imputed XML documents as a stream of SAX [6] events. Two core SAX events are startElement(qname) and endElement(qname), which are activated, respectively, when the opening or closing tag of a streaming element arrives, and accept the name of that element, qname

Elimination of ‘parent’ axis

The ‘parent’ axis specifies the child-parent relationship between two PXPE-nodes. In fact, the child-parent relationship can be expressed by the parent–child relationship using the ‘child’ axis. Therefore we do not need to design the respective part in the algorithm for the ‘parent’ axis in the XML document filtering, we simply transform the PXPE-trees with ‘parent’ axes to parent axis free XPEs. In this section, we propose a series of lemmas to eliminate the ‘parent’ axis in a PXPE-tree.

NIndex: indexing XPath expressions

In this section, we explain how to build NIndex, the index of XPEs. From Theorem 1, we can transform valid queries with the ‘parent’ axis into queries without the ‘parent’ axis. Therefore the queries indexed by NIndex do not consider the ‘parent’ axis. In the following, we first introduce the structure of NIndex, then discuss how to update in NIndex.

Filtering algorithm based on NIndex

Based on NIndex, we propose a filtering algorithm that checks, for a given document D and a collection of PXPE-trees, which PXPE-tree matches D. As discussed above, each PXPE-tree T rooted at node r is stored as a row ci in the table Tn (here n is the tag name of r) in NIndex. p_dc and p_a of ci record all child nodes of r with ‘descendant’/‘child’ and ‘ancestor’ axes, respectively. To process ‘descendant’/‘child’ axes, a stack will be created for ci when the start tag of an element e with tag

XML documents

NITF (News Industry Text Format) DTD [11] and IBM’s XML Generator tool [13] were used to generate our XML document data set. NITF DTD contains 123 elements with 513 attributes. We generated a set of XML documents with the number of elements varying from 1000 to 8000 and the size varying from 9 K to 900 K.

XPath expressions

We implemented an XPath expression generator that takes a DTD as input and creates a set of valid XPath expressions based on the following five input parameters. The parameter P controls the size

Related works and comparisons

For the XPath query evaluation problem, most existing works focus on structural joins and indexing. Al-Khalifa proposed algorithms Tree-Merge join and Stack-Tree join in [2], and Bruno proposed algorithm Path-Stack [7]. Several indexing schemes were proposed for XPath query processing including LORE [31], dataguides [14], ToXin [35], XISS [27], index fabric [10], and ViST [38]. The core problem of these works is to efficiently match the descendant axis in XML documents. The ‘ancestor’ and

Conclusions

In this paper, we addressed the problem of filtering XML documents with large number of user submitted queries in XPath expressions. We proposed a new index structure called NIndex. It offers several features that make it especially attractive for large scale SDI systems. NIndex is designed to support effective filtering based on complex XPath expressions containing the reverse axes ‘ancestor’ and ‘parent’. It reduces the number of unnecessary index probes and avoids the redundant matchings.

Acknowledgements

This research was supported by the Australian Research Council Discovery Projects (Grant Nos. DP0878405 and DP110102407), the National Natural Science Foundation of China (Grant Nos. 60972090 and 61073057), the Fundamental Research Funds for the Central Universities of China (Grant No. 2011ZD010).

References (42)

  • D. Brownell, D. Megginson. SAX: Simple API for XML. SAX Project Organization, 2009....
  • N. Bruno, N. Koudas, D. Srivastava, Holistic twig joins: optimal XML pattern matching, in: Proceedings of SIGMOD, 2002,...
  • C.Y. Chan, P. Felber, M.N. Garofalakis, R. Rastogi, Efficient filtering of XML documents with XPath expressions, in:...
  • J. Clark, S. DeRose, XML Path Language (XPath) Version 1.0. W3C, 1999....
  • B. Cooper, N. Sample, M. Franklin, G.R. Hjaltason, M. Shadmon, A fast index for semistructured data, in: Proceedings of...
  • R. Cover, The SGML/XML Web Page, 1999....
  • Y. Diao, P. Fischer, M. Franklin, R. To, Yfilter: efficient and scalable filtering of XML documents, in: Proceeding of...
  • A. Diaz, D. Lovell, XML Generator, 1999....
  • R. Goldman, J. Widom, Data guides, enabling query formulation and optimization in semistructured database, in:...
  • X.Q. Gong, W.N. Qian, Y. Yan, A.Y. Zhou, Bloom filter-based XML packets filtering for millions of path queries, in:...
  • G. Gou, R. Chirkova, Efficient algorithms for evaluating XPath over streams, in: Proceeding of SIGMOD, 2007, pp....
  • Cited by (8)

    • Handling distributed XML queries over large XML data based on MapReduce framework

      2018, Information Sciences
      Citation Excerpt :

      However, they must look up all the elements and values to conduct the evaluation. Recently, Ning et al. [39] proposed a NIndex structure to support effective filtering based on complex XPath expressions. In [40], Hsu et al. proposed a CIS-X index scheme, which combines the advantages of the structural summary and query processing methods.

    • Leveraging spatial join for robust tuple extraction from web pages

      2014, Information Sciences
      Citation Excerpt :

      After tuples are extracted from web pages, they can be easily transformed to different structures. Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries [36] to find attributes of tuples in the HTML pages. In such systems, given a sample HTML page T, the user first defines a target schema S for the tuples to extract and associates XPath queries of the HTML elements (corresponding to the attributes of the tuples) in T to target elements of S.

    • Weighting tags and paths in XML documents according to their topic generalization

      2013, Information Sciences
      Citation Excerpt :

      As XML emerges as a data exchange standard on the Web and in many applications, it is important to design appropriate querying process to retrieve XML documents or components. There have been many investigations on the topic [14,29]. A simplistic method is to rely only on keywords, and ignores the tags of elements surrounding the terms [3,8].

    • A model for aggregation and filtering on encrypted XML streams in fog computing

      2017, International Journal of Distributed Sensor Networks
    • Filtering uncertain XML documents by threshold XPEs

      2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text