Abstract
Computing the similarity between data objects is a fundamental operation for many distributed applications such as those on the World Wide Web, in Peer-to-Peer networks, or even in Sensor Networks. In our work, we provide a framework based on Random Hyperplane Projection (RHP) that permits continuous computation of similarity estimates (using the cosine similarity or the correlation coefficient as the preferred similarity metric) between data descriptions that are streamed from remote sites. These estimates are computed at a monitoring node, without the need for transmitting the actual data values. The original RHP framework is data agnostic and works for arbitrary data sets. However, data in most applications is not uniform. In our work, we first describe the shortcomings of the RHP scheme, in particular, its inefficiency to exploit evident skew in the underlying data distribution and then propose a novel framework that automatically detects correlations and computes an RHP embedding in the Hamming cube tailored to the provided data set using the idea of derived dimensions we first introduce. We further discuss extensions of our framework in order to cope with changes in the data distribution. In such cases, our technique automatically reverts to the basic RHP model for data items that cannot be described accurately through the computed embedding. Our experimental evaluation using several real and synthetic data sets demonstrates that our proposed scheme outperforms the existing RHP algorithm and alternative techniques that have been proposed, providing significantly more accurate similarity computations using the same number of bits.
Similar content being viewed by others
References
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: STOC (1996)
Andoni A., Indyk P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Athitsos V., Alon J., Sclaroff S., Kollios G.: BoostMap: an embedding method for efficient nearest neighbor retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 30(1), 89–104 (2008)
Athitsos, V., Potamias, M., Papapetrou, P., Kollios, G.: Nearest neighbor retrieval using distance-based hashing. In: ICDE, pp. 327–336 (2008)
Bawa, M., Condie, T., Ganesan, P.: Lsh Forest: self-tuning indexes for similarity search. In: WWW, pp. 651–660 (2005)
Berchtold, S., Böhm, C., Kriegel, H.-P.: The pyramid-technique: towards breaking the curse of dimensionality. In: SIGMOD Conference, pp. 142–153 (1998)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC (2002)
Chávez E., Navarro G., Baeza-Yates R.A., Marroquín J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)
Chen, J., Kher, S., Somani, A.: Distributed fault detection of wireless sensor networks. In: DIWANS (2006)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG (2004)
Deligiannakis, A., Kotidis, Y., Roussopoulos, N.: Compressing historical information in sensor networks. In: ACM SIGMOD (2004)
Deligiannakis, A., Kotidis, Y., Roussopoulos, N.: Hierarchical in-network data aggregation with quality guarantees. In: Proceedings of EDBT (2004)
Deligiannakis, A., Kotidis, Y., Vassalos, V., Stoumpos, V., Delis, A.: Another outlier bites the dust: computing meaningful aggregates in sensor networks. In: ICDE (2009)
Dobra A., Garofalakis M.N., Gehrke J., Rastogi R.: Multi-query optimization for sketch-based estimation. Inf. Syst. 34(2), 209–230 (2009)
Dong, W., Wang, Z., Josephson, W., Charikar, M., Li, K.: Modeling LSH for performance tuning. In: CIKM, pp. 669–678 (2008)
Faloutsos, C. and Lin, K.-I.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: SIGMOD Conference, pp. 163–174 (1995)
Giatrakos, N., Kotidis, Y., Deligiannakis, A.: PAO: power-efficient attribution of outliers in wireless sensor networks. In: DMSN (2010)
Giatrakos, N., Kotidis, Y., Deligiannakis, A., Vassalos, V., Theodoridis, Y.: TACO: tunable approximate computation of outliers in wireless sensor networks. In: SIGMOD (2010)
Gilbert, A., Kotidis, Y., Muthukrishnan, S., Strauss, M.: QuickSAND: quick summary and analysis of network data. Technical report, DIMACS 2001-43, Dec (2001)
Gilbert, A.C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Fast, small-space algorithms for approximate histogram maintenance. In: STOC, pp. 389–398 (2002)
Gilbert A.C., Kotidis Y., Muthukrishnan S., Strauss M.: One-pass wavelet decompositions of data streams. IEEE Trans. Knowl. Data Eng. 15(3), 541–554 (2003)
Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tunable similar set retrieval. In: SIGMOD (2001)
Goemans, M., Williamson, D.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6) (1995)
Guha S., Koudas N., Shim K.: Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst. 31(1), 396–438 (2006)
Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: EDBT, pp. 744–755 (2009)
Hua, Y., Xiao, B., Feng, D., Yu, B.: Bounded LSH for similarity search in peer-to-peer file systems. In: ICPP, pp. 644–651 (2008)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)
Jagadish, H.V., Ooi, B.C., and Vu, Q.H.: BATON: a balanced tree structure for peer-to-peer networks. In: VLDB, pp. 661–672 (2005)
Jeffery, S., Alonso, G., Franklin, M.J., Hong, W., Widom, J.: Declarative support for sensor data cleaning. In: Pervasive (2006)
Traina C.T.A.J.M. Jr., Faloutsos C., Seeger B.: Fast indexing and visualization of metric data sets using slim-trees. IEEE Trans. Knowl. Data Eng. 14(2), 244–260 (2002)
Kalnis P., Ng W.S., Ooi B.C., Tan K.-L.: Answering similarity queries in peer-to-peer networks. Inf. Syst. 31(1), 57–72 (2006)
Keogh E.J., Chakrabarti K., Pazzani M.J., Mehrotra S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001)
Kim, J.W., Candan, K.S.: Skip-and-prune: cosine-based top-k query processing for efficient context-sensitive document retrieval. In SIGMOD Conference, pp. 115–126 (2009)
Kotidis, Y., Deligiannakis, A., Stoumpos, V., Vassalos, V., Delis, A.: Robust management of outliers in sensor network aggregate queries. In: MobiDE (2007)
Koudas, N., Marathe, A., Srivastava, D.: Propagating updates in SPIDER. In: ICDE, pp. 1146–1153 (2007)
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)
Madden S., Franklin M.J., Hellerstein J.M., Hong W.: TAG: a tiny aggregation service for ad hoc sensor networks. In: OSDI Conference (2002)
Ravichandran, D., Pantel, P., Hovy, E.: Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering. In: ACL (2005)
Sacharidis D., Deligiannakis A., Sellis T.K.: Hierarchically compressed wavelet synopses. VLDB J. 18(1), 203–231 (2009)
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD Conference, pp. 563–576 (2009)
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), (2010)
Vlachou, A., Doulkeridis, C., Kotidis, Y.: Peer-to-peer similarity search based on M-tree indexing. In: DASFAA (2), pp. 269–275 (2010)
Vlachou, A., Doulkeridis, C., Kotidis, Y., Vazirgiannis, M.: SKYPEER: efficient subspace skyline computation over distributed data. In: Proceedings of ICDE (2007)
Xiao, X., Peng, W., Hung, C., Lee, W.: Using sensorranks for in-network detection of faulty readings in wireless sensor networks. In: MobiDE (2007)
Xue, G., Jiang, Y., You, Y., Li, M.: A topology-aware hierarchical structured overlay network based on locality sensitive hashing scheme. In: UPGRADE (2007)
Yu, C., Ooi, B.C., Tan, K.-L., Jagadish, H.V.: Indexing the Distance: An Efficient Method to KNN Processing. In: VLDB, pp. 421–430 (2001)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Georgoulas, K., Kotidis, Y. Distributed similarity estimation using derived dimensions. The VLDB Journal 21, 25–50 (2012). https://doi.org/10.1007/s00778-011-0233-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-011-0233-y