A distributed selectivity-driven search strategy for semi-structured data over DHT-based networks

https://doi.org/10.1016/j.jpdc.2016.03.015Get rights and content

Highlights

  • A DHT-based framework for indexing and locating XML data over distributed networks.

  • An Adaptive Path Selection (APS) search algorithm that minimizes network traffic.

  • A space-efficient Path Selectivity Table (PST) for path selectivity estimation.

  • A distributed algorithm for PST construction with logarithmic performance bounds.

Abstract

Distributed Hash Tables (DHTs) are widely used for indexing and locating many types of resources, including semi-structured data modeled as XML documents. A common distributed strategy to process an XML query over a DHT consists in splitting it into a set of simple path queries, and resolving each of them separately. The traffic generated by this strategy grows with the number of paths in the query. To overcome this drawback, an alternative strategy consists in resolving only the sub-query associated with the most selective path, and then submitting the original query to the nodes in the result set. A first goal of this paper is to provide an analytical and experimental study of the two strategies to assess their relative merits in different scenarios. On the basis of this study, we introduce an Adaptive Path Selection (APS) search technique that resolves an XML query in a distributed way by querying either the most selective path or the whole path set, based on the selectivity of the paths in the query. The effective use of APS requires that the querying nodes know in advance the selectivity of all the paths. Addressing this problem is another goal of the paper, which is achieved through: (i) The definition of a space-efficient data structure, the Path Selectivity Table (PST), which given any path, returns an estimate of its selectivity. (ii) The definition of an efficient strategy that builds the PST in a distributed way and propagates it to all nodes in the network with logarithmic performance bounds and without redundant messages. Experimental results show that the PST accurately estimates the path selectivity values, and that the traffic generated by the APS algorithm using PST-estimated selectivity values is comparable to that produced by APS assuming to know the real path selectivity values.

Introduction

Distributed Hash Tables (DHTs) are decentralized systems providing scalable services for indexing and locating data in large-scale networks. DHT-based systems like Chord  [30], Pastry  [27], Tapestry  [34], and Kademlia  [20], assign to each node the responsibility for a specific part of the data to be shared. In a network of n nodes, when a node wants to find a data item identified by a given key, a DHT allows to locate the node responsible for that key in O(logn) hops, using only O(logn) neighbors per node. Thanks to their inherent reliability and autonomic properties, DHTs can be effectively used in dynamic peer-to-peer networks with nodes continuously joining and leaving  [30], as well as in static decentralized systems composed by a large number of nodes permanently connected to a wide-area network  [31]. In both cases, an important system goal is limiting the network traffic generated by the distributed query processing. A key toward this goal is efficiently locating relevant data sources, so as to submit the queries only to the nodes where those data sources are stored.

Leveraging a DHT, complex queries over a large collection of distributed data can be processed with guarantee that all the relevant documents are located with logarithmic performance bounds. In a DHT-based system, a query Q can be processed in two phases: (i) the DHT is looked up to identify all nodes that store data matching Q; (ii) Q is submitted to each node identified during the previous phase, to get all the data matching Q. In this work we focus on the first phase of the query processing, with the goal of minimizing the amount of traffic generated to identify the nodes that will be queried during the second phase.

Even though DHTs can be used for indexing many types of data, in this paper we concentrate on XML-based semi-structured data, as XML is widely used as a language for information representation and exchange over the Internet. We assume that data sources are distributed over a large number of nodes (i.e., tens to hundreds of thousands) permanently connected to the network like, for instance, a world-wide network of Internet-enabled sensor stations organized in a DHT. Another example is a large-scale DHT-based network of service providers, where a large number of services, published by different providers in XML format, need to be dynamically discovered and integrated into complex distributed applications. Examples include e-commerce and e-science applications, where the single components are available as services independently specified by their providers  [22].

The indexing of an XML document D in a DHT can be done by associating a key to each path p in D; then, the node responsible for the key associated with p keeps a pointer to the nodes storing all documents containing p, including D. To search for XML documents matching a complex tree pattern query formulated, for instance, as an XPath or XQuery expression, a basic strategy consists in splitting the query into a number of sub-queries, one for each path in the query. Each sub-query is resolved independently to find the set of nodes that store documents matching the corresponding path. The result sets coming from the different sub-queries are intersected at the querying node. Then all the nodes in the intersection set are queried with the original query to obtain all the documents matching that query.

The network traffic generated by the strategy above increases with the number of paths in the query. This can lead to system inefficiency in case of complex queries composed of several sub-queries, particularly in presence of many concurrent requests. To overcome this drawback, an alternative strategy consists in resolving only the sub-query associated with the most selective path, i.e., the path that matches the lowest number of nodes; then all nodes in the result set can be queried with the original query to get the documents that satisfy all the query constraints (including those associated with the other paths). The selectivity of a path is defined as follows:

Definition 1 Path Selectivity

The selectivity sp of a path p is given by the equation: sp=np/n where np is the number of nodes that store at least one instance of path p, and n is the total number of nodes in the DHT. By definition, 0<sp1.

The lower the selectivity value, the more selective the path; in other words, the lowest selectivity value corresponds to the most selective path. For instance, in a network with 10,000 nodes, a path stored in 50 nodes has a selectivity of 50/10,000  =  0.005, while a path stored in 5000 nodes has a selectivity of 5000/10,000  =  0.5. The former can be an example of highly selective path (low selectivity value), the latter of lowly selective path (high selectivity value).

A first goal of this paper is to provide an analytical and experimental study of the two strategies to assess their relative merits in different scenarios. On the basis of this study, we introduce an Adaptive Path Selection (APS) search technique that resolves an XML query by querying either the most selective path or the whole path set, based on the selectivity of the paths in the query. APS uses path selectivity values to calculate the traffic that would be generated querying the most selective path and the whole path set, which allows selecting the most efficient strategy to follow for a given query. Experimental results confirm that APS saves a significant amount of traffic compared to the two strategies from which it derives.

The effective use of APS requires that all nodes know in advance the selectivity of all the paths. If path selectivity values are not known a priori, techniques for estimating the path selectivity values can be used. We propose a compact data structure, called Path Selectivity Table (PST), which groups paths with similar selectivity values into a fixed number of buckets. Bloom Filters are used to represent the paths in each bucket so that, for a given path, the bucket containing the selectivity of the path can be quickly located. Thus, given any path, the PST returns an estimate of its selectivity.

Our solution differs from existing systems since it supports node-based selectivity estimation, which allows every peer to estimate the total number of nodes that store sources relevant to a query. This is a key advantage as it allows estimating in advance the network traffic that will be generated by any query, as the traffic produced by distributed query processing depends on the number of nodes with relevant sources. The node-based selectivity estimation strategy enables another unique feature of our solution, which is adaptive lookup. In fact, our APS search strategy allows peers to resolve a query by querying either the most selective path or the whole path set. This permits to achieve good traffic performance, while maintaining a basic indexing/search scheme that can be easily implemented on top of any DHT. Another unique feature exhibited by our system is local selectivity estimation. In fact, the PST allows every participant peer to estimate locally the selectivity of a path, without querying the network for this purpose.

A preliminary version of this work, which focused on introducing the APS technique, was presented in  [7]. In the present version we make the following additional contributions:

  • 1.

    A more formal definition of the APS algorithm and of the two basic strategies from which it derives, together with their analytical and experimental evaluation.

  • 2.

    The definition of a space-efficient data structure, the PST, for XML path selectivity estimation in a distributed scenario.

  • 3.

    The definition of a PST Construction and Propagation (PST_CP) algorithm that builds the PST in a distributed way and propagates it to all nodes in the network with logarithmic performance bounds.

  • 4.

    An experimental evaluation demonstrating the PST accuracy for path selectivity estimation, as well as the efficiency of the PST in supporting APS search in distributed scenarios.

We point out that the APS technique could be exploited by non-holistic related approaches that similarly to us index XML data over a DHT using paths as indexing elements. In fact, the APS elaborates on the two basic query resolution strategies to process XML queries over a DHT, proposing an adaptive strategy that uses either the whole path set or the most selective path. This way, related approaches could achieve better results in terms of network traffic generated because depending on the specific scenario one strategy could outperform the other one. Moreover, both holistic and non-holistic approaches could exploit our PST Construction and Propagation (PST_CP) algorithm to construct a PST-like data structure in a distributed way to propagate selectivity estimation all over the network. This can be particularly relevant for centralized approaches that would be able this way to implement a distributed strategy to process XML queries over a DHT.

The remainder of the paper is organized as follows. Section  2 discusses related work. Section  3 presents the system model, including the data model and the way XML documents are indexed. Section  4 describes and compares the two basic approaches exploited by the APS search strategy to answer an XML query. Section  5 proposes the APS algorithm and evaluates it in different scenarios. Section  6 presents the PST-based approach to path selectivity estimation, including a detailed description of the PST_CP algorithm. Section  7 evaluates the accuracy of PSTs for path selectivity estimation and their effectiveness in supporting APS-based query processing. Finally, Section  8 concludes the paper.

Section snippets

Related work

The work that first introduced the WPS and MSP approaches that we rely on is  [6], which proposes a Multi-Attribute Addressable Network (MAAN) that extends Chord to support multi-attribute and range queries. MAAN addresses range queries by mapping attribute values to Chord via uniform locality preserving hashing. It uses an iterative (corresponding to our WSP search) or single attribute dominated query (corresponding to our MSP search) routing algorithm to resolve multi-attribute queries. Apart

System model

We assume a system composed of a set N of autonomous nodes, organized in a DHT-based structured P2P network like Chord  [30], and a set D of XML documents distributed over those nodes and indexed using the DHT to support their efficient identification and retrieval.

Example 1

Consider the case of a world-wide network of Internet-enabled weather sensor stations, organized in a DHT-based Chord overlay. Each sensor station locally stores, in XML format, a sensor station descriptor (SSD) and a set of sensor

Query processing

In a DHT-based system, the query processing can be divided into two phases:

  • 1.

    the DHT is looked up to identify all nodes that store XML documents matching a query Q;

  • 2.

    Q is sent to the nodes identified in the previous phase, which will execute Q locally.

As stated earlier, we focus on the first phase of the query processing, with the goal of minimizing the amount of traffic generated to identify the nodes that will be queried during the second phase.

To search for XML documents indexed in a DHT

Adaptive Path Selection search

The results presented in the previous section highlight that, for any value of m, there are cases in which WPS generates less traffic than MSP, and vice versa. In particular, it can be shown that:

Theorem 3

The traffic generated by MSP is lower than the traffic generated by WPS when spmin<Th, where:Th=((m1)(SH+12(SH+SS)log2n)+SCni=1mspi+(2SH+SQ)ni=1mspi)÷((SC+2SH+SQ)n).

Proof

The theorem can be easily proven by solving the inequality T¯MSP<T¯WPS or, equivalently, TMSP<TWPS. 

We exploit this result

Path selectivity estimation

Like the MSP search technique, APS needs to know the selectivity of the paths in the query. This requires two issues to be addressed: (i) defining a space-efficient data structure to store the selectivity of each path in the network; (ii) devising an effective solution to build and propagate such data structure across the network. We address the first issue by defining a Path Selectivity Table (PST) for XML path selectivity estimation in a distributed scenario, and the second one by defining a

PST performance

This section evaluates the accuracy of PSTs for path selectivity estimation and their effectiveness in supporting APS-based query processing.

Conclusions

We analytically compared the WPS and MSP search strategies to query XML data indexed in a DHT-based system. Based on this study, we defined the Adaptive Path Selection (APS) search strategy that adaptively resolves an XML query by querying either the most selective path or the whole path set. Experimental results confirmed that APS saves a significant amount of traffic compared to WPS and MSP.

Like MSP, APS needs to know the selectivity of the paths in the query. This required two issues to be

Carmela Comito is a researcher at the Institute of High Performance Computing and Networking of the Italian National Research Council (ICAR-CNR), Italy. She received her Ph.D. in systems and computer engineering from the University of Calabria, Italy. Her current research interests include peer-to-peer systems, mobile and ubiquitous computing, data mining, and energy-aware distributed systems.

References (34)

  • G. Pirro et al.

    A DHT-based semantic overlay network for service discovery

    Future Gener. Comput. Syst.

    (2012)
  • D. Talia et al.

    Enabling dynamic querying over distributed hash tables

    J. Parallel Distrib. Comput.

    (2010)
  • K. Aberer
  • K. Aberer et al.

    P-Grid: a self-organizing structured P2P system

    SIGMOD Rec.

    (2003)
  • S. Abiteboul, I. Manolescu, N. Polyzotis, N. Preda, C. Sun, XML processing in DHT networks. 24th IEEE Int. Conf. Data...
  • A. Aboulnaga, A.R. Alameldeen, J.F. Naughton, Estimating the selectivity of XML path expressions for Internet scale...
  • B. Bloom

    Space/time tradeoffs in hash coding with allowable errors

    Commun. ACM

    (1970)
  • M. Cai et al.

    MAAN: A multi-attribute addressable network for grid information services

    J. Grid Comput.

    (2004)
  • C. Comito, D. Talia, P. Trunfio, Selectivity-based XML query processing in structured peer-to-peer networks, in: 14th...
  • S. El-Ansary, L. Alima, P. Brand, S. Haridi, Efficient broadcast in structured P2P networks, in: 2nd Int. Workshop on...
  • Li Fan et al.

    Summary cache: a scalable wide-area web cache sharing protocol

    IEEE/ACM Trans. Netw.

    (2000)
  • D.K. Fisher, S. Maneth, Structural Selectivity Estimation for XML Documents, in: Proc. of the 23th IEEE Intl....
  • L. Galanis, Y. Wang, S.R. Jeffery, D.J. DeWitt, Locating data sources in large distributed systems, in: 29th Int. Conf....
  • L. Garces-Erice, P.A. Felber, E.W. Biersack, G. Urvoy-Keller, K.W. Ross, Data indexing in peer-to-peer DHT networks,...
  • A. Ghodsi

    Multicast and Bulk lookup in structured overlay networks

  • H.V. Jagadish, V. Poosala, N. Koudas, K. Sevcik, S. Muthukrishnan, T. Suel, Optimal histograms with quality guarantees,...
  • K. Karanasos, A. Katsifodimos, I. Manolescu, S. Zoupanos, ViP2P: Efficient XML management in DHT networks, in: 12th...
  • Cited by (2)

    Carmela Comito is a researcher at the Institute of High Performance Computing and Networking of the Italian National Research Council (ICAR-CNR), Italy. She received her Ph.D. in systems and computer engineering from the University of Calabria, Italy. Her current research interests include peer-to-peer systems, mobile and ubiquitous computing, data mining, and energy-aware distributed systems.

    Domenico Talia is a professor of computer engineering at the University of Calabria. His research interests include parallel and distributed data mining algorithms, Cloud computing, Grid services, distributed knowledge discovery, peer-to-peer systems, and parallel programming models. He published eleven books and more than 300 papers in archival journals and conference proceedings.

    Paolo Trunfio is an associate professor of computer engineering at the University of Calabria, Italy. He received his Ph.D. in systems and computer engineering from the same university. In 2007 he was a visiting researcher at the Swedish Institute of Computer Science. His research interests include Cloud computing, service-oriented architectures, distributed knowledge discovery, and peer-to-peer systems.

    View full text