A distributed selectivity-driven search strategy for semi-structured data over DHT-based networks
Introduction
Distributed Hash Tables (DHTs) are decentralized systems providing scalable services for indexing and locating data in large-scale networks. DHT-based systems like Chord [30], Pastry [27], Tapestry [34], and Kademlia [20], assign to each node the responsibility for a specific part of the data to be shared. In a network of nodes, when a node wants to find a data item identified by a given key, a DHT allows to locate the node responsible for that key in hops, using only neighbors per node. Thanks to their inherent reliability and autonomic properties, DHTs can be effectively used in dynamic peer-to-peer networks with nodes continuously joining and leaving [30], as well as in static decentralized systems composed by a large number of nodes permanently connected to a wide-area network [31]. In both cases, an important system goal is limiting the network traffic generated by the distributed query processing. A key toward this goal is efficiently locating relevant data sources, so as to submit the queries only to the nodes where those data sources are stored.
Leveraging a DHT, complex queries over a large collection of distributed data can be processed with guarantee that all the relevant documents are located with logarithmic performance bounds. In a DHT-based system, a query can be processed in two phases: (i) the DHT is looked up to identify all nodes that store data matching ; (ii) is submitted to each node identified during the previous phase, to get all the data matching . In this work we focus on the first phase of the query processing, with the goal of minimizing the amount of traffic generated to identify the nodes that will be queried during the second phase.
Even though DHTs can be used for indexing many types of data, in this paper we concentrate on XML-based semi-structured data, as XML is widely used as a language for information representation and exchange over the Internet. We assume that data sources are distributed over a large number of nodes (i.e., tens to hundreds of thousands) permanently connected to the network like, for instance, a world-wide network of Internet-enabled sensor stations organized in a DHT. Another example is a large-scale DHT-based network of service providers, where a large number of services, published by different providers in XML format, need to be dynamically discovered and integrated into complex distributed applications. Examples include e-commerce and e-science applications, where the single components are available as services independently specified by their providers [22].
The indexing of an XML document in a DHT can be done by associating a key to each path in ; then, the node responsible for the key associated with keeps a pointer to the nodes storing all documents containing , including . To search for XML documents matching a complex tree pattern query formulated, for instance, as an XPath or XQuery expression, a basic strategy consists in splitting the query into a number of sub-queries, one for each path in the query. Each sub-query is resolved independently to find the set of nodes that store documents matching the corresponding path. The result sets coming from the different sub-queries are intersected at the querying node. Then all the nodes in the intersection set are queried with the original query to obtain all the documents matching that query.
The network traffic generated by the strategy above increases with the number of paths in the query. This can lead to system inefficiency in case of complex queries composed of several sub-queries, particularly in presence of many concurrent requests. To overcome this drawback, an alternative strategy consists in resolving only the sub-query associated with the most selective path, i.e., the path that matches the lowest number of nodes; then all nodes in the result set can be queried with the original query to get the documents that satisfy all the query constraints (including those associated with the other paths). The selectivity of a path is defined as follows: Definition 1 Path Selectivity The selectivity of a path is given by the equation: where is the number of nodes that store at least one instance of path , and is the total number of nodes in the DHT. By definition, .
A first goal of this paper is to provide an analytical and experimental study of the two strategies to assess their relative merits in different scenarios. On the basis of this study, we introduce an Adaptive Path Selection (APS) search technique that resolves an XML query by querying either the most selective path or the whole path set, based on the selectivity of the paths in the query. APS uses path selectivity values to calculate the traffic that would be generated querying the most selective path and the whole path set, which allows selecting the most efficient strategy to follow for a given query. Experimental results confirm that APS saves a significant amount of traffic compared to the two strategies from which it derives.
The effective use of APS requires that all nodes know in advance the selectivity of all the paths. If path selectivity values are not known a priori, techniques for estimating the path selectivity values can be used. We propose a compact data structure, called Path Selectivity Table (PST), which groups paths with similar selectivity values into a fixed number of buckets. Bloom Filters are used to represent the paths in each bucket so that, for a given path, the bucket containing the selectivity of the path can be quickly located. Thus, given any path, the PST returns an estimate of its selectivity.
Our solution differs from existing systems since it supports node-based selectivity estimation, which allows every peer to estimate the total number of nodes that store sources relevant to a query. This is a key advantage as it allows estimating in advance the network traffic that will be generated by any query, as the traffic produced by distributed query processing depends on the number of nodes with relevant sources. The node-based selectivity estimation strategy enables another unique feature of our solution, which is adaptive lookup. In fact, our APS search strategy allows peers to resolve a query by querying either the most selective path or the whole path set. This permits to achieve good traffic performance, while maintaining a basic indexing/search scheme that can be easily implemented on top of any DHT. Another unique feature exhibited by our system is local selectivity estimation. In fact, the PST allows every participant peer to estimate locally the selectivity of a path, without querying the network for this purpose.
A preliminary version of this work, which focused on introducing the APS technique, was presented in [7]. In the present version we make the following additional contributions:
- 1.
A more formal definition of the APS algorithm and of the two basic strategies from which it derives, together with their analytical and experimental evaluation.
- 2.
The definition of a space-efficient data structure, the PST, for XML path selectivity estimation in a distributed scenario.
- 3.
The definition of a PST Construction and Propagation (PST_CP) algorithm that builds the PST in a distributed way and propagates it to all nodes in the network with logarithmic performance bounds.
- 4.
An experimental evaluation demonstrating the PST accuracy for path selectivity estimation, as well as the efficiency of the PST in supporting APS search in distributed scenarios.
We point out that the APS technique could be exploited by non-holistic related approaches that similarly to us index XML data over a DHT using paths as indexing elements. In fact, the APS elaborates on the two basic query resolution strategies to process XML queries over a DHT, proposing an adaptive strategy that uses either the whole path set or the most selective path. This way, related approaches could achieve better results in terms of network traffic generated because depending on the specific scenario one strategy could outperform the other one. Moreover, both holistic and non-holistic approaches could exploit our PST Construction and Propagation (PST_CP) algorithm to construct a PST-like data structure in a distributed way to propagate selectivity estimation all over the network. This can be particularly relevant for centralized approaches that would be able this way to implement a distributed strategy to process XML queries over a DHT.
The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 presents the system model, including the data model and the way XML documents are indexed. Section 4 describes and compares the two basic approaches exploited by the APS search strategy to answer an XML query. Section 5 proposes the APS algorithm and evaluates it in different scenarios. Section 6 presents the PST-based approach to path selectivity estimation, including a detailed description of the PST_CP algorithm. Section 7 evaluates the accuracy of PSTs for path selectivity estimation and their effectiveness in supporting APS-based query processing. Finally, Section 8 concludes the paper.
Section snippets
Related work
The work that first introduced the WPS and MSP approaches that we rely on is [6], which proposes a Multi-Attribute Addressable Network (MAAN) that extends Chord to support multi-attribute and range queries. MAAN addresses range queries by mapping attribute values to Chord via uniform locality preserving hashing. It uses an iterative (corresponding to our WSP search) or single attribute dominated query (corresponding to our MSP search) routing algorithm to resolve multi-attribute queries. Apart
System model
We assume a system composed of a set of autonomous nodes, organized in a DHT-based structured P2P network like Chord [30], and a set of XML documents distributed over those nodes and indexed using the DHT to support their efficient identification and retrieval. Example 1 Consider the case of a world-wide network of Internet-enabled weather sensor stations, organized in a DHT-based Chord overlay. Each sensor station locally stores, in XML format, a sensor station descriptor (SSD) and a set of sensor
Query processing
In a DHT-based system, the query processing can be divided into two phases:
- 1.
the DHT is looked up to identify all nodes that store XML documents matching a query ;
- 2.
is sent to the nodes identified in the previous phase, which will execute locally.
To search for XML documents indexed in a DHT
Adaptive Path Selection search
The results presented in the previous section highlight that, for any value of , there are cases in which WPS generates less traffic than MSP, and vice versa. In particular, it can be shown that: Theorem 3 The traffic generated by MSP is lower than the traffic generated by WPS when , where:
Proof The theorem can be easily proven by solving the inequality or, equivalently, . ■
We exploit this result
Path selectivity estimation
Like the MSP search technique, APS needs to know the selectivity of the paths in the query. This requires two issues to be addressed: (i) defining a space-efficient data structure to store the selectivity of each path in the network; (ii) devising an effective solution to build and propagate such data structure across the network. We address the first issue by defining a Path Selectivity Table (PST) for XML path selectivity estimation in a distributed scenario, and the second one by defining a
PST performance
This section evaluates the accuracy of PSTs for path selectivity estimation and their effectiveness in supporting APS-based query processing.
Conclusions
We analytically compared the WPS and MSP search strategies to query XML data indexed in a DHT-based system. Based on this study, we defined the Adaptive Path Selection (APS) search strategy that adaptively resolves an XML query by querying either the most selective path or the whole path set. Experimental results confirmed that APS saves a significant amount of traffic compared to WPS and MSP.
Like MSP, APS needs to know the selectivity of the paths in the query. This required two issues to be
Carmela Comito is a researcher at the Institute of High Performance Computing and Networking of the Italian National Research Council (ICAR-CNR), Italy. She received her Ph.D. in systems and computer engineering from the University of Calabria, Italy. Her current research interests include peer-to-peer systems, mobile and ubiquitous computing, data mining, and energy-aware distributed systems.
References (34)
- et al.
A DHT-based semantic overlay network for service discovery
Future Gener. Comput. Syst.
(2012) - et al.
Enabling dynamic querying over distributed hash tables
J. Parallel Distrib. Comput.
(2010) - et al.
P-Grid: a self-organizing structured P2P system
SIGMOD Rec.
(2003) - S. Abiteboul, I. Manolescu, N. Polyzotis, N. Preda, C. Sun, XML processing in DHT networks. 24th IEEE Int. Conf. Data...
- A. Aboulnaga, A.R. Alameldeen, J.F. Naughton, Estimating the selectivity of XML path expressions for Internet scale...
Space/time tradeoffs in hash coding with allowable errors
Commun. ACM
(1970)- et al.
MAAN: A multi-attribute addressable network for grid information services
J. Grid Comput.
(2004) - C. Comito, D. Talia, P. Trunfio, Selectivity-based XML query processing in structured peer-to-peer networks, in: 14th...
- S. El-Ansary, L. Alima, P. Brand, S. Haridi, Efficient broadcast in structured P2P networks, in: 2nd Int. Workshop on...
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Trans. Netw.
Multicast and Bulk lookup in structured overlay networks
Cited by (2)
An approach to develop a secure and decentralized internet
2019, 2019 International Conference on Nascent Technologies in Engineering, ICNTE 2019 - ProceedingsMultipathP2p: A simple multipath ant routing system for P2P networks
2018, International Journal of Autonomous and Adaptive Communications Systems
Carmela Comito is a researcher at the Institute of High Performance Computing and Networking of the Italian National Research Council (ICAR-CNR), Italy. She received her Ph.D. in systems and computer engineering from the University of Calabria, Italy. Her current research interests include peer-to-peer systems, mobile and ubiquitous computing, data mining, and energy-aware distributed systems.
Domenico Talia is a professor of computer engineering at the University of Calabria. His research interests include parallel and distributed data mining algorithms, Cloud computing, Grid services, distributed knowledge discovery, peer-to-peer systems, and parallel programming models. He published eleven books and more than 300 papers in archival journals and conference proceedings.
Paolo Trunfio is an associate professor of computer engineering at the University of Calabria, Italy. He received his Ph.D. in systems and computer engineering from the same university. In 2007 he was a visiting researcher at the Swedish Institute of Computer Science. His research interests include Cloud computing, service-oriented architectures, distributed knowledge discovery, and peer-to-peer systems.