Elsevier

Information Sciences

Volume 181, Issue 2, 15 January 2011, Pages 284-307
Information Sciences

An efficient mechanism for processing similarity search queries in sensor networks

https://doi.org/10.1016/j.ins.2010.08.031Get rights and content

Abstract

The similarity search problem has received considerable attention in database research community. In sensor network applications, this problem is even more important due to the imprecision of the sensor hardware, and variation of environmental parameters. Traditional similarity search mechanisms are both improper and inefficient for these highly energy-constrained sensors. A difficulty is that it is hard to predict which sensor has the most similar (or closest) data item such that many or even all sensors need to send their data to the query node for further comparison. In this paper, we propose a similarity search algorithm (SSA), which is a novel framework based on the concept of Hilbert curve over a data-centric storage structure, for efficiently processing similarity search queries in sensor networks. SSA successfully avoids the need of collecting data from all sensors in the network in searching for the most similar data item. The performance study reveals that this mechanism is highly efficient and significantly outperforms previous approaches in processing similarity search queries.

Introduction

The past research on query processing in sensor networks mainly focused on retrieving exact answer from the networks. However, the detected data of sensors may be imprecise either due to the lag of database update [4], [13], [17], [25], [28], or due to noisy readings [2], [5], [8], [30]. In the former case, the massiveness of readings and the limited energy and wireless bandwidth may not allow for continuous and instantaneous updates. Therefore, the database state may not reflect the true state of the real world. It is often infeasible for the database to contain the exact status of an entity being monitored at every moment in time. Typically, the data of an entity is known with certainty only at the time of the update. The later case, however, is due to inaccuracies of measurements. The sources of inaccuracies include, but are not limited to: (a) noise from external sources, (b) inaccuracies in the measurement technique, and (c) imprecision in computing a derived value from the underlying measurements.

The reasons of imprecision are not all hardware-related. An application itself may also requires some similar (or close) data in addition to the exact answer. For one example, sensors are used for detecting mudflow and landslides in mountain villages that are in danger. Researchers need to gather the observed data from sensors and warn the villagers before heavy rains or typhoons to be alert for possible disasters. Assume that the geographic condition for triggering mudflow and landslides, such as soil water content (swc), is equal to x. A query of “swc being equal to x” is usually meaningless in monitoring mudflow and landslide environments, because once the swc is close to x the hazard can happen at any moment. Hence, what researchers actually need is to find the location where the monitored swc data is close to x. This allows them to warn and evacuate villagers to avoid deaths and injuries.

Another application for a similarity search query in a sensor network is that an outdoor biologist analyzes habitats for birds by collecting calls of birds with acoustic sensors. The biologist needs to recognize bird species based on the collected data. As a bird may generate sound of a wide frequency band and the bands of different bird species may overlap, it is very hard to retrieve bird of one species without accessing other bird species of similar sound. Hence, similarity search is almost an unavoidable type of query in these applications.

Under such circumstances, the exact retrieval is not a mandatory requirement anymore. Similar or the nearest data is equally important in answering a user query. Traditional similarity search algorithms [6], [7], [9], [12], [15], [18], [29] are either centralized or assumed that each node of a system is very powerful (such as in the world wide web or a peer-to-peer environment), which are unsuitable for sensor networks due to the limited availability of bandwidth and power of sensors. Current query processing algorithms for sensor networks, however, are inefficient in processing similarity search queries. These algorithms mainly focused on two types of queries, point queries and range queries [10], [21], [26]. A point query means to find results from sensors that own a value exactly matches the given value of the query. A range query is to retrieve results from sensors that have the values falling in the given range of the query. While executing a point query in the sensor network, the sensors only return those data that exactly match the given query. Utilizing a point query processing technique to process a similarity search requires that the user issues multiple point query of similar conditions so as to retrieve similar data. However, processing multiple queries in this case causes a rapid energy consumption to the sensors. Using a range query processing technique to process a similarity search, on the other hand, faces two major problems. First, redundant results might be transmitted to the query node. For example, relative humidity is an important factor influencing orchid’s growth. An orchid flower planter is looking for a place where the relative humidity is closest to 75% if there is not a place of humidity being 75%. The planter issues a range query such as finding the humidity between 65% and 85%. If there are five sensors detecting their humidity, 65%, 67%, 74%, 80%, and 85%, falling in the range of the given query, these five sensors will all reply to the query node. However, only the humidity 74% is the closest result to the given query, and it is the only one that should be sent back. Using this range query processing method, however, extra four tuples are transmitted to the query node which wastes the sensor’s energy. The second problem is that it might not be easy for a user to specify an appropriate range in a similarity search query. The reason is that if the given range is too small, there may not be any result qualifying for the condition. If the given range is wide, there may be too many qualifying results returned, which again wastes sensor energy. Therefore, the past point query and the range query processing techniques are improper for processing similarity search queries in sensor networks.

A major challenge in processing a similarity search query in a sensor network is that each sensor is only a minirepository of an entire distributed sensor database. Each sensor only has the knowledge of its local data, but has no global knowledge of the entire sensor database. Hence, while processing a similarity search query, each sensor does not know whether its local similar data is the globally most similar data and has to transmit its local similar data to somewhere (e.g., the query node) for further verification. This causes a serious waste of sensor energy for data transmission and data forwarding.

In this paper, we propose a similarity search algorithm (SSA) to overcome the above problems in processing similarity search queries. We choose a group of sensors, which are named the indexing nodes, to store data based on the data-centric storage (DCS) concept [26]. DCS uses in-network placement of data to increase the efficiency of data retrieval in certain circumstances. The placement of a detected data item is determined according to the event type of this data. The event type refers to certain pre-defined constellations of event values such as temperature and pressure. A detected data with a particular event is stored at an indexing node. The indexing node is determined by looking up a geographic hash table (GHT) [26] using the event of the data. These indexing nodes are so chosen from the entire sensor network that they actually form a Hilbert curve [11], [31] in the network. The adjacent indexing nodes along the Hilbert curve have data of similar values. Hence, searching similar data in this arrangement becomes very easy. In this paper, we will discuss how this scheme is realized in a sensor network environment and how deep (i.e., how many levels) the Hilbert curve should be implemented. Our performance study indicates that the proposed method provides a significantly lower query processing cost than a previous method while processing a similarity search query. Another elegant feature is that this method is scalable with respect to the number of queries and the amount of detected data items.

The main contributions of this paper are as follows.

  • 1.

    This work is the first one to provide an algorithm for searching similar data in wireless sensor network environments.

  • 2.

    The data mapping is based on the concept of Hilbert curve, which is simple and easy to implement. The indexing node to which a detected data item should be mapped can be determined distributedly by each sensor, which avoids centralized data dispatching to indexing nodes.

  • 3.

    The whole processing is in-network. The number of involved indexing nodes in processing a similarity search query is only a few, which avoids the need of transmission of local similar data from all sensors and therefore dramatically simplifies the task and reduces the energy consumption of sensors.

As a preliminary study of the problem, this paper mainly focuses on processing similarity search queries for one-dimensional data which means that a query is specified only for one type of events. We leave the multi-dimensional part of the work as our future work. The subsequent content of this paper is organized as follows. A representative previous work on data-centric storage that can possibly be applied to processing similarity search query is surveyed and discussed in Section 2. Section 3 presents the proposed algorithm for a similarity search and two extensions of the proposed algorithm are given in Section 4. Section 5 presents the simulation results. Finally, we give our conclusions and future work in Section 6.

Section snippets

Related work

To the best of our knowledge, data-centric storage in Sensornets with a geographic hash table (GHT) [26] may be the most representative approach among the past DCS-based research that is applicable in processing similarity search in a distributed sensor network. However, there exist some obstacles for such schemes to process a similarity search query. We illustrate them in the following.

Essentially, GHT hashes the type of an event into geographic coordinates and stores the detected data of this

Design of a data-centric storage system supporting similarity search

In this section, we first illustrate how to map a Hilbert curve to a sensor network in Section 3.1 and analyze the complexity of building a Hilbert curve in Section 3.2. In Section 3.3, we explain how to select an indexing node and then we propose a data insertion mechanism for storing data in an indexing node. A search mechanism is proposed in Section 3.4 which efficiently finds the answer for a given query. In Section 3.5, we analyze the complexity of the proposed mechanism. The workload

Extensions of our similarity search algorithm

In this section, we propose two extensions of our similarity search algorithm. One is to support the processing of a range query, and another is to support multiple queries processing.

Simulation results

In this section we verify the effectiveness of our work, the proposed similarity search algorithm (SSA), by comparing it against SR-GHT in processing similarity search queries. Since the communication cost is the main part of energy consumption of sensors, we use the number of exchanged messages as the comparison metrics. The variables include network size, node density, node distribution, and the number of levels of a Hilbert curve. We also compare the performance of the SSA with the SR-GHT

Conclusion

In this paper, we proposed the design and implementation of an algorithm for processing similarity search queries in sensor networks. Our design applies the concept of Hilbert curve to sensor networks such that semantically related data are mapped to adjacent indexing nodes. A similarity search algorithm was proposed for efficiently processing similarity search queries. Such a query can be directly routed to an indexing node to find the matching result or the one that is closest to the given

References (33)

  • Zhao Feng et al.

    Wireless Sensor Network: An Information Processing Approach

    (2004)
  • M. Flickner et al.

    Query by image and video content: the qbic system

    IEEE Computer

    (1995)
  • Benjamin Greenstein, Deborah Estrin, Ramesh Govindan, Sylvia Ratnasamy, Scott Shenker, Difs: a distributed index for...
  • J.G. Griffiths

    An algorithm for displaying a class of space-filling curves

    Software-Practice and Experience

    (1986)
  • T. Hastie, R. Tibshirani, Discriminant adaptive nearest neighbor classification, in: Proceedings of the First...
  • T. He, C. Huang, B.M. Blum, J.A. Stankovic, T.F. Abdelzaher, Range-free localization schemes in large scale sensor...
  • Cited by (0)

    View full text