Similarity search in sensor networks using semantic-based caching

doi:10.1016/j.jnca.2011.05.008

Journal of Network and Computer Applications

Volume 35, Issue 2, March 2012, Pages 577-583

https://doi.org/10.1016/j.jnca.2011.05.008 Get rights and content

Abstract

Sensor networks build temporary wireless connections in environments where the stationary infrastructures are either destroyed or too expensive to construct. Most of the previous research in sensor networks focuses on routing protocols that adapt to the dynamic network topologies, and not much work has been done on data accessing. One important data accessing application is similarity search, which provides the foundation of content-based retrieval. Many traditional similarity search algorithms are based on centralized or flooding mechanisms, which are not effective in wireless sensor network environments due to the multiple limitations such as bandwidth and power. In this paper we tackle the problem of similarity search by using semantic-based caching to reflect the data content distribution in the network. The basic idea is analyzing the cached results of earlier queries and trying to resolve the later queries within a small collection of content-related mobile nodes. Based on a Hilbert space-filling curve, the data points in a multi-dimensional semantic space are described as a linear representation. These data points are further cached to facilitate query processing. Through extensive simulations, we show that our method can perform similarity search with improved performance in terms of search cost and response time.

Highlights

► Similarity search in sensor networks is a challenging task due to the dynamic network topology, limited system resources, and infrastructure-free nature. ► We tackle the problem of similarity search by using semantic-based caching to reflect the data content distribution. ► Based on a Hilbert space-filling curve, the data points in a multi-dimensional semantic space are described as a linear representation. ► These data points are further cached to facilitate query processing.

Introduction

The information retrieval research has been facing new challenges raised by the emerging large-scale distributed networking systems, such as peer-to-peer networks, sensor networks, and mesh networks. Among all these networking systems, sensor network is the most sophisticated and general form due to its dynamic network topology, limited system resources, and infrastructure-free nature. Conceptually, a sensor network is a collection of cooperative mobile nodes that communicate with each other without the intervention of centralized access points. In a sensor network, the conventional information retrieval schemes (Beckmann et al., 1990, Gionis et al., 2001, Ren and Dunham, 2000), based on either centralized or flooding search mechanisms, cannot guarantee satisfactory performance due to the restrictions of bandwidth, power, and connectivity. The recently proposed improvements have achieved better performance in several experimental systems (Tang et al., 2003, Hara, 2001); however, the existing search schemes seem to ignore the data content distribution among the mobile nodes. The search process is either blind, irrelevant to the queries, or based on static features of data sources. As the consequence, the search process is either inefficient or involved with high system overhead.

Similarity search, as a fundamental topic in information retrieval, has attracted considerable research attention in the past decade (Beyer et al., 1994). The problem of similarity search can be generally described as finding the top-k most relevant data objects of a given query according to the predefined search criteria and semantic distance metrics. Performing similarity search in sensor networks is a challenging task due to the lack of infrastructure: consider a commonly used similarity search—content-based multimedia data retrieval. When a query is issued, due to the mobility and infrastructure-free nature, the contents of data sources are unknown at the requesting node, and therefore flooding has to be employed for the query resolution. However, flooding implies large consumption of system resources—storage, bandwidth, and energy. In addition, due to the share-medium nature of most sensor networks, flooding may also cause serious redundancy, contention, and collision of rebroadcast messages in the shared wireless channels, which is known as the Broadcast Storm (Ni et al., 1999). Moreover, considering the sheer sizes of multimedia data, the performance deterioration is more drastic.

Caching frequently accessed data objects is an efficient technique for improving system performance in mobile environments. Average data access latency can be reduced as similarity queries may be issued on the same data objects cached in the local storage, thereby avoiding the need of repeated blind search in the network. Moreover, analysis of the earlier queries and their results may help deriving the content distribution knowledge of the sensor network, which provides the heuristic navigation to the similarity search for later queries. Based on the content distribution knowledge, the similarity search can be performed as follows: for a query submitted to the sensor network, the mobile nodes in the network can be divided into two categories: the nodes containing data related to the query (relevant nodes) and the nodes that do not contain related data (irrelevant nodes). To improve the performance, one should avoid query processing by the irrelevant nodes and try to forward the query only to the relevant nodes. By doing this, the wireless network traffic is reduced, and the system performance is improved as well.

This paper is intended to provide a solution to the efficient similarity search in sensor networks using adaptive semantic-based caching. We defined the Hilbert curve based representation of data contents of mobile nodes, and proposed a novel caching scheme—Ad-hoc Semantic Caching (ASC)—to facilitate the process of similarity search in sensor networks. Simulation results show that the proposed scheme can significantly reduce the search cost in terms of query delay and message complexity.

The remaining part of this paper is divided into four sections: Section 2 reviews the related work in similarity search and caching schemes in sensor networks. Section 3 gives the concepts of Hilbert-curve-based representation, providing the theoretical foundation for caching. Section 4 explains the semantic-based caching scheme. Section 5 evaluates the proposed caching scheme using experimental analysis. Section 6 draws the paper into conclusions.

Section snippets

Similarity search

For years, similarity search on spatial data has attracted considerable research interest, especially in the multi-dimensional spaces, e.g. multimedia feature space. The queries of semantically similar data objects are performed by conducting nearest-neighbor retrieval in the multi-dimensional semantic spaces. For presentation simplicity, the data objects, as well as the queries, are considered as data points in the semantic spaces. The result of a query is a collection of data objects closest

Preliminaries

Given a set of data objects X={x₁, x₂, …, x_m}, each data object x_i is represented as an n-dimensional semantic vector $φ_{x i} = (ω_{x i}^{1}, ω_{x i}^{2}, \dots, ω_{x i}^{n})$ . The attributes in the semantic vector can be features selected from various application scenarios. From mathematical viewpoint, these data objects can also be considered as data points in the n-dimensional semantic space. Therefore, the representation of data objects can be performed as describing the content distribution of data points in the semantic

Semantic-based caching

In this section, we introduce the semantic-based caching scheme—Adaptive Semantic Caching (ASC)—tailored for similarity search. By providing the content distribution knowledge based on Hilbert curve, mobile similarity search can be answered efficiently using the cached query results. In addition, the replacement of cache content does not cause heavy maintenance overhead.

Performance evaluation

Extensive simulations are carried out to evaluate the performance of the ASC scheme and compare it with two recently proposed schemes—Semantic Caching (SMC) (Ren and Dunham, 2000) and Hybrid Ad Hoc Caching (HAC) (Yin and Cao, 2004).

Conclusions

We proposed a dynamic semantic-based caching scheme that facilitates similarity search in sensor networks. This scheme is based on semantic analysis of cached query results to represent data content distribution in the network. It has several innovative characteristics such as Hilbert curve representation, semantic locality exploitation, and non-flooding query resolution.

The analysis and experimental evaluation gives us an overview of the performance of the proposed scheme. We also found that

References (17)

C. Aggarwal et al.
A new method for similarity indexing of market basket data
ACM SIGMOD
(1999)
N. Beckmann et al.
The R*-tree: an efficient and robust access method for points and rectangles
ACM SIGMOD
(1990)
K. Beyer et al.
When is “nearest neighbor” meaningful?
VLDB
(1994)
A.W. Fu et al.
Dynamic VP-tree indexing for n-nearest neighbor search given pair-wise distances
VLDB
(2000)
A. Gionis et al.
Efficient and tunable similar set retrieval
ACM SIGMOD
(2001)
T. Hara
Efficient replica allocation in ad hoc networks for improving data accessibility
INFOCOM
(2001)
H. Hu et al.
Proactive caching for spatial queries in mobile environments
IEEE ICDE
(2005)
N. Katayama et al.
The SR-tree: an index structure for high-dimensional nearest neighbor queries
ACM SIGMOD
(1997)

There are more references available in the full text version of this article.

Cited by (13)

HCube: Routing and similarity search in Data Centers
2016, Journal of Network and Computer Applications
Citation Excerpt :
In the literature, one strategy to deal with this huge amount of data is the similarity search, whose purpose is to retrieve sets of similar data given a similarity threshold. Similarity search is used in lots of another scenarios, such as sensor networks (Yang and Mareboyana, 2012) and peer-to-peer networks (Haghani et al., 2009). However, traditional Data Center solutions are efficient for handling traditional applications, such as batch processing of large volumes of data, but they do not offer adequate support for the similarity search.
The current Big Data scenario is mainly characterized by the huge amount of data available on the Internet. Some deployed mechanisms for handling such raw data rely on Data Centres (DCs) based on massive storage, memory and processing capacity, in which solutions like BigTable, MapReduce and Dynamo process information in order to provide its retrieval. The HCube presents a DC alternative for data storage/retrieval based on the similarity search, in which similar content is concentrated on servers physically close within the HCube, simplifying the recovery of similar data. A similarity search is performed using a primitive $get (k, sim)$ , in which k represents the reference content and sim a similarity threshold. The HCube network is organized in a three dimensional structure, in which the Gray Space Filling Curve (SFC) in conjunction with the Random Hyperplane Hashing (RHH) function and the XOR-based flat routing mechanism offer an efficient and powerful mechanism for the similarity search. In this context, this work presents the HCube networking solution, detailing the benefits of using the Gray SFC and the XOR-based flat routing mechanism for the similarity search.
In memory of Mieso Denko
2012, Journal of Network and Computer Applications
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching
2022, Proceedings - IEEE INFOCOM
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching
2021, arXiv
Dynamic data driven-based automatic clustering and semantic annotation for internet of things sensor data
2019, Sensors and Materials
A novel approach for automation of smart homes, based on internet of things, using fuzzy ontology
2018, 2018 8th International Conference on Computer and Knowledge Engineering, ICCKE 2018

View all citing articles on Scopus

View full text

Similarity search in sensor networks using semantic-based caching

Abstract

Highlights

Introduction

Section snippets

Similarity search

Preliminaries

Semantic-based caching

Performance evaluation

Conclusions

A new method for similarity indexing of market basket data

ACM SIGMOD

The R*-tree: an efficient and robust access method for points and rectangles

ACM SIGMOD

When is “nearest neighbor” meaningful?

VLDB

Dynamic VP-tree indexing for n-nearest neighbor search given pair-wise distances

VLDB

Efficient and tunable similar set retrieval

ACM SIGMOD

Efficient replica allocation in ad hoc networks for improving data accessibility

INFOCOM

Proactive caching for spatial queries in mobile environments

IEEE ICDE

The SR-tree: an index structure for high-dimensional nearest neighbor queries

ACM SIGMOD