XML filtering with XPath expressions containing parent and ancestor axes

doi:10.1016/j.ins.2012.04.035

Information Sciences

Volume 210, 25 November 2012, Pages 41-54

https://doi.org/10.1016/j.ins.2012.04.035 Get rights and content

Abstract

More and more XML data is generated and used for data exchange. In this paper, we address the problem of filtering XML documents with large number of XPath expressions, which may contain ‘ancestor’ and ‘parent’ axes. XPath expressions with these axes are more powerful and flexible for users to describe their interests in publish/subscribe systems. First, we analyze the characteristics of the ‘parent’ axis and propose a series of rules to eliminate it in XPath expressions. Then we propose a new index structure called NIndex, which is designed to efficiently store and index large number of XPath expressions. NIndex offers several features which make it especially attractive for the large scale selective dissemination of information, including the ability to handle complex XPath expressions with ‘ancestor’ and ‘parent’ axes, and efficient pruning. Based on NIndex, we design a new filtering algorithm with low complexity for our problem. Our experiment results show that our algorithm performs well across a range of XPath expressions and documents.

Introduction

The proliferation of the Internet, and the exploding volume of information available on the Internet has fueled the development of a wide range of new applications based on selective dissemination of information (SDI) [3]. These applications include stock exchange, advertisement systems, electronic personalized newspapers, online shopping, online auctioning and entertainment delivery, and require timely distribution of data to a large set of customers. In an SDI (or publish/subscribe) system, users subscribe to a data server with continuous queries or profiles that are expressed in some well-defined languages for expressing their information needs. The SDI system performs the matching task and ensures timely delivery of published data to all interested subscribers.

With XML becoming the standard of data representation and exchange on the Internet, effective and efficient methods have been studied for searching useful information from ordinary and probabilistic XML documents by both structured queries and keyword queries [2], [7], [25], [26], [28], [30], [33], [42]. XML is also adopted for content-based publish/subscribe systems because published XML messages have flexible document structures and subscription rules can be expressed by a powerful language such as XPath [9] and XQuery [5]. In an XML publish/subscribe system, XML filtering is a core part, in which continuously arriving XML documents are routed to users according to their subscriptions expressed as queries. These queries typically specify patterns of selection on multiple elements.

Tremendous effort has been put into efficiently filtering XML streams with a collection of XPath expressions [3], [8], [12], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [32], [36], [41], including XFilter [3], YFilter [12], XTrie [8], XPush [18] and the lazy filtering algorithm (LF) [16]. Most of these systems only support forward axes such as child and descendant axes.

However in real applications, XML queries submitted by users may contain predicates with ancestor and parent axes. Firstly, documents for dissemination usually have different schemas [1], [29], [39], consequently different relationships between ancestor tags of a certain tag t in a query may exist in different documents. Sometimes, users are unconcerned about exact relationships between those ancestor tags as long as they are ancestors of tag t. Moreover, users may not always be aware of the schemas of XML documents. For the above reasons, users have to use the predicates with ancestor or parent axes to constrain the answers.

For example, Alice wants to buy some books from a book selling website. She is interested in those computer books authored by “John”, and published by “Addison-Wesley”. The documents (a) and (b) shown in Fig. 1 all satisfy the Alice’s request, and therefore both of them should be disseminated to Alice. This can be realized by submitting the following query Q with an ancestor axis easily. $\begin{matrix} Q : / / Computer / / book [ancestor ∷ Addison - Wesley] \\ / / author [name = “ John ”] \end{matrix}$ However, if Alice does not use the ancestor axis and submits the following query Q_a instead. $Q_{a} : / / Computer / / Addison - Wesley / / book / / author [name = “ John ”]$ then only document (a) will be returned because the forward axis in Q_a can express only one kind of relationship between the ancestor tags “Computer” and “Addison-Wesley” of the book tag, i.e. “Computer” is the ancestor of “Addison-Wesley”. Document (b) will not be returned because the ancestor tags “Computer” and “Addison-Wesley” have a different relationship, i.e. “Addison-Wesley” is the ancestor of “Computer”. From the example, we can clearly see that using the ancestor axis provides more expressive power than not using it. For improving flexibility, we also support the parent axis.

In this paper, we propose a new index structure named NIndex, which supports efficient filtering of XML documents based on XPath expressions (XPEs) which contain ancestor and parent axes. NIndex structure offers several novel features that make it especially attractive for large-scale publish/subscribe systems.

Firstly, we analyze the characteristics of the ‘parent’ axis and propose a series of rules to eliminate it from those XPEs with the ‘parent’ axis. Dan Olteanu et al. proposed rewriting rules [34] to transform absolute XPath location paths with reverse axes into equivalent reverse axis free paths, and to rewrite all axes to a minimum set of forward axes. However, the transformed XPEs contain join operations, which are not convenient to handle in an XML document filtering system. In this paper, we only consider transformation of the ‘parent’ axis and transformed XPEs contain only simple axes including ‘ancestor’,‘descendant’ and ‘child’.

Secondly, to improve filtering performance, our NIndex scheme captures common sub-expressions among multiple XPEs. Xtrie [8] is indexed on a set of substrings by using a substring table in which levels of elements are recorded. However, the level attribute may cause redundancy despite the identical sub-patterns. Different occurrences of the same sub-pattern are treated as different items in the substring table if they appear at different levels. Therefore our index is more concise to represent XPEs. It reduces the number of unnecessary index probes and avoids redundant matchings.

Thirdly, NIndex is designed to support effective filtering of XML documents based on complex XPEs containing ancestor and parent axes. Xaos [4] is a query processing system that also supports ancestor and parent axes, and it uses the X-dag structure to express queries. However the X-dag structure can only handle a single XPE, and does not provide support to handle the case of multiple XPEs.

Finally, the experiments show that our algorithm is time-efficient.

The remainder of this paper is organized as follows. In Section 2, we introduce the background and give the definition of a PXPE-tree. Section 3 discusses how to eliminate the parent axis in a PXPE-tree. In Section 4, we introduce the structure of NIndex which is the index for XPEs, and design updating algorithms for maintaining NIndex. In Section 5, we propose a filtering algorithm to address the XPE retrieval problem, which is based on NIndex. Section 6 shows our experimental results. Section 7 discusses related works and compare them with our work. Section 8 concludes the paper.

Section snippets

Data model for XML streams

An XML document can be modeled as a rooted, labeled, and ordered tree, which we call an XML data tree. Each node in the data tree corresponds to an element, attribute, or text value in the XML document. An XML streaming algorithm accepts imputed XML documents as a stream of SAX [6] events. Two core SAX events are startElement(qname) and endElement(qname), which are activated, respectively, when the opening or closing tag of a streaming element arrives, and accept the name of that element, qname

Elimination of ‘parent’ axis

The ‘parent’ axis specifies the child-parent relationship between two PXPE-nodes. In fact, the child-parent relationship can be expressed by the parent–child relationship using the ‘child’ axis. Therefore we do not need to design the respective part in the algorithm for the ‘parent’ axis in the XML document filtering, we simply transform the PXPE-trees with ‘parent’ axes to parent axis free XPEs. In this section, we propose a series of lemmas to eliminate the ‘parent’ axis in a PXPE-tree.

NIndex: indexing XPath expressions

In this section, we explain how to build NIndex, the index of XPEs. From Theorem 1, we can transform valid queries with the ‘parent’ axis into queries without the ‘parent’ axis. Therefore the queries indexed by NIndex do not consider the ‘parent’ axis. In the following, we first introduce the structure of NIndex, then discuss how to update in NIndex.

Filtering algorithm based on NIndex

Based on NIndex, we propose a filtering algorithm that checks, for a given document D and a collection of PXPE-trees, which PXPE-tree matches D. As discussed above, each PXPE-tree T rooted at node r is stored as a row c_i in the table T_n (here n is the tag name of r) in NIndex. p_dc and p_a of c_i record all child nodes of r with ‘descendant’/‘child’ and ‘ancestor’ axes, respectively. To process ‘descendant’/‘child’ axes, a stack will be created for c_i when the start tag of an element e with tag

XML documents

NITF (News Industry Text Format) DTD [11] and IBM’s XML Generator tool [13] were used to generate our XML document data set. NITF DTD contains 123 elements with 513 attributes. We generated a set of XML documents with the number of elements varying from 1000 to 8000 and the size varying from 9 K to 900 K.

XPath expressions

We implemented an XPath expression generator that takes a DTD as input and creates a set of valid XPath expressions based on the following five input parameters. The parameter P controls the size

Related works and comparisons

For the XPath query evaluation problem, most existing works focus on structural joins and indexing. Al-Khalifa proposed algorithms Tree-Merge join and Stack-Tree join in [2], and Bruno proposed algorithm Path-Stack [7]. Several indexing schemes were proposed for XPath query processing including LORE [31], dataguides [14], ToXin [35], XISS [27], index fabric [10], and ViST [38]. The core problem of these works is to efficiently match the descendant axis in XML documents. The ‘ancestor’ and

Conclusions

In this paper, we addressed the problem of filtering XML documents with large number of user submitted queries in XPath expressions. We proposed a new index structure called NIndex. It offers several features that make it especially attractive for large scale SDI systems. NIndex is designed to support effective filtering based on complex XPath expressions containing the reverse axes ‘ancestor’ and ‘parent’. It reduces the number of unnecessary index probes and avoids the redundant matchings.

Acknowledgements

This research was supported by the Australian Research Council Discovery Projects (Grant Nos. DP0878405 and DP110102407), the National Natural Science Foundation of China (Grant Nos. 60972090 and 61073057), the Fundamental Research Funds for the Central Universities of China (Grant No. 2011ZD010).

References (42)

A. Algergawy et al.
Element similarity measures in XML schema matching
Inf. Sci.
(2010)
H.H. Lee et al.
Selectivity-sensitive shared evaluation of multiple continuous XPath queries over XML streams
Inf. Sci.
(2009)
H.H. Lee et al.
Attribute-based evaluation of multiple continuous queries for filtering incoming tuples of a data stream
Inf. Sci.
(2008)
Z.M. Ma et al.
Matching twigs in fuzzy XML
Inf. Sci.
(2011)
J.K. Min et al.
XTREAM: an efficient multi-query evaluation on streaming XML data
Inf. Sci.
(2007)
A. Wojnar et al.
Structural and semantic aspects of similarity of document type definitions and XML schemas
Inf. Sci.
(2010)
S. Al-Khalifa, D. Srivastava, H.V. Jagadish, N. Koudas, J.M. Patel, Y. Wu, Structural joins, a primitive for efficient...
M. Altinel, M.J. Franklin, Efficient filtering of XML documents for selective dissemination of information, in:...
C. Barton, P. Charles, D. Goyal, M. Raghavachari, M. Fontoura, V. Josifovski, Streaming XPath processing with forward...
S. Boag, D. Chamberlin, M.F. Fernandez, D. Florescu, J. Robie, J. Simeon. XQuery 1.0: An XML Query Language, 1999....

D. Brownell, D. Megginson. SAX: Simple API for XML. SAX Project Organization, 2009....

N. Bruno, N. Koudas, D. Srivastava, Holistic twig joins: optimal XML pattern matching, in: Proceedings of SIGMOD, 2002,...

C.Y. Chan, P. Felber, M.N. Garofalakis, R. Rastogi, Efficient filtering of XML documents with XPath expressions, in:...

J. Clark, S. DeRose, XML Path Language (XPath) Version 1.0. W3C, 1999....

B. Cooper, N. Sample, M. Franklin, G.R. Hjaltason, M. Shadmon, A fast index for semistructured data, in: Proceedings of...

R. Cover, The SGML/XML Web Page, 1999....

Y. Diao, P. Fischer, M. Franklin, R. To, Yfilter: efficient and scalable filtering of XML documents, in: Proceeding of...

A. Diaz, D. Lovell, XML Generator, 1999....

R. Goldman, J. Widom, Data guides, enabling query formulation and optimization in semistructured database, in:...

X.Q. Gong, W.N. Qian, Y. Yan, A.Y. Zhou, Bloom filter-based XML packets filtering for millions of path queries, in:...

G. Gou, R. Chirkova, Efficient algorithms for evaluating XPath over streams, in: Proceeding of SIGMOD, 2007, pp....

Cited by (8)

Handling distributed XML queries over large XML data based on MapReduce framework
2018, Information Sciences
Citation Excerpt :
However, they must look up all the elements and values to conduct the evaluation. Recently, Ning et al. [39] proposed a NIndex structure to support effective filtering based on complex XPath expressions. In [40], Hsu et al. proposed a CIS-X index scheme, which combines the advantages of the structural summary and query processing methods.
With the increase in available extensible markup language (XML) documents, numerous approaches to querying have been proposed in the literature. XPath queries and Twig pattern queries are the two basic approaches, directly affecting the efficiency of XML operations. Distributive manipulation of massive XML data is challenging. This paper aims to develop an efficient distributed XML query processing method using MapReduce, which simultaneously processes several queries on large volumes of XML data. First, we split up a large-scale XML data file into file-splits and put them in a distributed storage system. Then, we present an efficient algorithm to compute different fragments of the document tree using the MapReduce framework in parallel. In order to efficiently handle a large amount of XML data, we built a partition index and used a random access mechanism for specific queries. The experiment results show that our proposed approach is efficient with good scalability.
Leveraging spatial join for robust tuple extraction from web pages
2014, Information Sciences
Citation Excerpt :
After tuples are extracted from web pages, they can be easily transformed to different structures. Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries [36] to find attributes of tuples in the HTML pages. In such systems, given a sample HTML page T, the user first defines a target schema S for the tuples to extract and associates XPath queries of the HTML elements (corresponding to the attributes of the tuples) in T to target elements of S.
Extracting tuples from HTML pages has been an important issue in various web applications. Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries to find attributes of tuples in the HTML pages. However, such systems would be vulnerable to small changes on the web pages. In this paper, we propose a robust tuple extraction system which utilizes spatial relationships among elements rather than the XPath queries. Spatial information (e.g., 2-D coordinates) of elements are maintained in the DOM tree when a web page is rendered in a browser. Our system regards elements in the rendered page as spatial objects in the 2-D space and executes spatial joins to extract target elements. Since humans also identify an element in a web page by its relative spatial location, our system extracting elements by their spatial relationships could possibly be as robust as manual extraction. To specify and execute spatial joins, we propose a new query language, RAQuery, based on topological relationships between any spatial objects in the 2-D space. We then propose spatial join algorithms that efficiently process the RAQuery using novel notions of group match and prunable relation group. We next propose a tuple construction algorithm to build tuples from the extracted elements obtained by the spatial joins, which can construct tuples even when there are no boundary HTML elements specified for the tuples in the web page. Extensive experimental results using real HTML pages confirm that our solutions are far more robust than existing tuple extraction systems without sacrificing performance.
Weighting tags and paths in XML documents according to their topic generalization
2013, Information Sciences
Citation Excerpt :
As XML emerges as a data exchange standard on the Web and in many applications, it is important to design appropriate querying process to retrieve XML documents or components. There have been many investigations on the topic [14,29]. A simplistic method is to rely only on keywords, and ignores the tags of elements surrounding the terms [3,8].
Text-centric (or document-centric) XML document retrieval aims to rank search results according to their relevance to a given query. To do this, most existing methods mainly rely on content terms and often ignore an important factor – the XML tags and paths, which are useful in determining the important contents of a document. In some previous studies, each unique tag/path is assigned a weight based on domain (expert) knowledge. However, such a manual assignment is both inefficient and subjective. In this paper, we propose an automatic method to infer the weights of tags/paths according to the topical relationship between the corresponding elements and the whole documents. The more the corresponding element can generalize the document’s topic, the more the tag/path is considered to be important. We define a model based on Average Topic Generalization (ATG), which integrates several features used in previous studies. We evaluate the performance of the ATG-based model on two real data sets, the IEEECS collection and the Wikipedia collection, from two different perspectives: the correlation between the weights generated by ATG and those set by experts, and the performance of XML retrieval based on ATG. Experimental results show that the tag/path weights generated by ATG are highly correlated with the manually assigned weights, and the ATG model significantly improves XML retrieval effectiveness.
A privacy protection approach for XML-based archives management in a cloud environment
2019, Electronic Library
A model for aggregation and filtering on encrypted XML streams in fog computing
2017, International Journal of Distributed Sensor Networks
Filtering uncertain XML documents by threshold XPEs
2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

View full text

XML filtering with XPath expressions containing parent and ancestor axes

Abstract

Introduction

Section snippets

Data model for XML streams

Elimination of ‘parent’ axis

NIndex: indexing XPath expressions

Filtering algorithm based on NIndex

XML documents

XPath expressions

Related works and comparisons

Conclusions

Acknowledgements

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.