Distributed similarity estimation using derived dimensions

Georgoulas, Konstantinos; Kotidis, Yannis

doi:10.1007/s00778-011-0233-y

Distributed similarity estimation using derived dimensions

Regular Paper
Published: 22 April 2011

Volume 21, pages 25–50, (2012)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Konstantinos Georgoulas¹ &
Yannis Kotidis¹

236 Accesses
7 Citations
Explore all metrics

Abstract

Computing the similarity between data objects is a fundamental operation for many distributed applications such as those on the World Wide Web, in Peer-to-Peer networks, or even in Sensor Networks. In our work, we provide a framework based on Random Hyperplane Projection (RHP) that permits continuous computation of similarity estimates (using the cosine similarity or the correlation coefficient as the preferred similarity metric) between data descriptions that are streamed from remote sites. These estimates are computed at a monitoring node, without the need for transmitting the actual data values. The original RHP framework is data agnostic and works for arbitrary data sets. However, data in most applications is not uniform. In our work, we first describe the shortcomings of the RHP scheme, in particular, its inefficiency to exploit evident skew in the underlying data distribution and then propose a novel framework that automatically detects correlations and computes an RHP embedding in the Hamming cube tailored to the provided data set using the idea of derived dimensions we first introduce. We further discuss extensions of our framework in order to cope with changes in the data distribution. In such cases, our technique automatically reverts to the basic RHP model for data items that cannot be described accurately through the computed embedding. Our experimental evaluation using several real and synthetic data sets demonstrates that our proposed scheme outperforms the existing RHP algorithm and alternative techniques that have been proposed, providing significantly more accurate similarity computations using the same number of bits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: STOC (1996)
Andoni A., Indyk P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Article Google Scholar
Athitsos V., Alon J., Sclaroff S., Kollios G.: BoostMap: an embedding method for efficient nearest neighbor retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 30(1), 89–104 (2008)
Article Google Scholar
Athitsos, V., Potamias, M., Papapetrou, P., Kollios, G.: Nearest neighbor retrieval using distance-based hashing. In: ICDE, pp. 327–336 (2008)
Bawa, M., Condie, T., Ganesan, P.: Lsh Forest: self-tuning indexes for similarity search. In: WWW, pp. 651–660 (2005)
Berchtold, S., Böhm, C., Kriegel, H.-P.: The pyramid-technique: towards breaking the curse of dimensionality. In: SIGMOD Conference, pp. 142–153 (1998)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC (2002)
Chávez E., Navarro G., Baeza-Yates R.A., Marroquín J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)
Article Google Scholar
Chen, J., Kher, S., Somani, A.: Distributed fault detection of wireless sensor networks. In: DIWANS (2006)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG (2004)
Deligiannakis, A., Kotidis, Y., Roussopoulos, N.: Compressing historical information in sensor networks. In: ACM SIGMOD (2004)
Deligiannakis, A., Kotidis, Y., Roussopoulos, N.: Hierarchical in-network data aggregation with quality guarantees. In: Proceedings of EDBT (2004)
Deligiannakis, A., Kotidis, Y., Vassalos, V., Stoumpos, V., Delis, A.: Another outlier bites the dust: computing meaningful aggregates in sensor networks. In: ICDE (2009)
Dobra A., Garofalakis M.N., Gehrke J., Rastogi R.: Multi-query optimization for sketch-based estimation. Inf. Syst. 34(2), 209–230 (2009)
Article Google Scholar
Dong, W., Wang, Z., Josephson, W., Charikar, M., Li, K.: Modeling LSH for performance tuning. In: CIKM, pp. 669–678 (2008)
Faloutsos, C. and Lin, K.-I.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: SIGMOD Conference, pp. 163–174 (1995)
Giatrakos, N., Kotidis, Y., Deligiannakis, A.: PAO: power-efficient attribution of outliers in wireless sensor networks. In: DMSN (2010)
Giatrakos, N., Kotidis, Y., Deligiannakis, A., Vassalos, V., Theodoridis, Y.: TACO: tunable approximate computation of outliers in wireless sensor networks. In: SIGMOD (2010)
Gilbert, A., Kotidis, Y., Muthukrishnan, S., Strauss, M.: QuickSAND: quick summary and analysis of network data. Technical report, DIMACS 2001-43, Dec (2001)
Gilbert, A.C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Fast, small-space algorithms for approximate histogram maintenance. In: STOC, pp. 389–398 (2002)
Gilbert A.C., Kotidis Y., Muthukrishnan S., Strauss M.: One-pass wavelet decompositions of data streams. IEEE Trans. Knowl. Data Eng. 15(3), 541–554 (2003)
Article Google Scholar
Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tunable similar set retrieval. In: SIGMOD (2001)
Goemans, M., Williamson, D.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6) (1995)
Guha S., Koudas N., Shim K.: Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst. 31(1), 396–438 (2006)
Article Google Scholar
Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: EDBT, pp. 744–755 (2009)
Hua, Y., Xiao, B., Feng, D., Yu, B.: Bounded LSH for similarity search in peer-to-peer file systems. In: ICPP, pp. 644–651 (2008)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)
Jagadish, H.V., Ooi, B.C., and Vu, Q.H.: BATON: a balanced tree structure for peer-to-peer networks. In: VLDB, pp. 661–672 (2005)
Jeffery, S., Alonso, G., Franklin, M.J., Hong, W., Widom, J.: Declarative support for sensor data cleaning. In: Pervasive (2006)
Traina C.T.A.J.M. Jr., Faloutsos C., Seeger B.: Fast indexing and visualization of metric data sets using slim-trees. IEEE Trans. Knowl. Data Eng. 14(2), 244–260 (2002)
Article Google Scholar
Kalnis P., Ng W.S., Ooi B.C., Tan K.-L.: Answering similarity queries in peer-to-peer networks. Inf. Syst. 31(1), 57–72 (2006)
Article Google Scholar
Keogh E.J., Chakrabarti K., Pazzani M.J., Mehrotra S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001)
Article MATH Google Scholar
Kim, J.W., Candan, K.S.: Skip-and-prune: cosine-based top-k query processing for efficient context-sensitive document retrieval. In SIGMOD Conference, pp. 115–126 (2009)
Kotidis, Y., Deligiannakis, A., Stoumpos, V., Vassalos, V., Delis, A.: Robust management of outliers in sensor network aggregate queries. In: MobiDE (2007)
Koudas, N., Marathe, A., Srivastava, D.: Propagating updates in SPIDER. In: ICDE, pp. 1146–1153 (2007)
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)
Madden S., Franklin M.J., Hellerstein J.M., Hong W.: TAG: a tiny aggregation service for ad hoc sensor networks. In: OSDI Conference (2002)
Ravichandran, D., Pantel, P., Hovy, E.: Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering. In: ACL (2005)
Sacharidis D., Deligiannakis A., Sellis T.K.: Hierarchically compressed wavelet synopses. VLDB J. 18(1), 203–231 (2009)
Article Google Scholar
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD Conference, pp. 563–576 (2009)
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), (2010)
Vlachou, A., Doulkeridis, C., Kotidis, Y.: Peer-to-peer similarity search based on M-tree indexing. In: DASFAA (2), pp. 269–275 (2010)
Vlachou, A., Doulkeridis, C., Kotidis, Y., Vazirgiannis, M.: SKYPEER: efficient subspace skyline computation over distributed data. In: Proceedings of ICDE (2007)
Xiao, X., Peng, W., Hung, C., Lee, W.: Using sensorranks for in-network detection of faulty readings in wireless sensor networks. In: MobiDE (2007)
Xue, G., Jiang, Y., You, Y., Li, M.: A topology-aware hierarchical structured overlay network based on locality sensitive hashing scheme. In: UPGRADE (2007)
Yu, C., Ooi, B.C., Tan, K.-L., Jagadish, H.V.: Indexing the Distance: An Efficient Method to KNN Processing. In: VLDB, pp. 421–430 (2001)

Download references

Author information

Authors and Affiliations

Athens University of Economics and Business, 76 Patission Street, Athens, Greece
Konstantinos Georgoulas & Yannis Kotidis

Authors

Konstantinos Georgoulas
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Kotidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Konstantinos Georgoulas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Georgoulas, K., Kotidis, Y. Distributed similarity estimation using derived dimensions. The VLDB Journal 21, 25–50 (2012). https://doi.org/10.1007/s00778-011-0233-y

Download citation

Received: 08 August 2010
Revised: 12 March 2011
Accepted: 06 April 2011
Published: 22 April 2011
Issue Date: February 2012
DOI: https://doi.org/10.1007/s00778-011-0233-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed similarity estimation using derived dimensions

Abstract

Access this article

Similar content being viewed by others

Siamese Neural Networks: An Overview

A Comprehensive Survey of Anomaly Detection Algorithms

K-Means algorithm based on multi-feature-induced order

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed similarity estimation using derived dimensions

Abstract

Access this article

Similar content being viewed by others

Siamese Neural Networks: An Overview

A Comprehensive Survey of Anomaly Detection Algorithms

K-Means algorithm based on multi-feature-induced order

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation