Skip to main content
Log in

Distributed similarity estimation using derived dimensions

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Computing the similarity between data objects is a fundamental operation for many distributed applications such as those on the World Wide Web, in Peer-to-Peer networks, or even in Sensor Networks. In our work, we provide a framework based on Random Hyperplane Projection (RHP) that permits continuous computation of similarity estimates (using the cosine similarity or the correlation coefficient as the preferred similarity metric) between data descriptions that are streamed from remote sites. These estimates are computed at a monitoring node, without the need for transmitting the actual data values. The original RHP framework is data agnostic and works for arbitrary data sets. However, data in most applications is not uniform. In our work, we first describe the shortcomings of the RHP scheme, in particular, its inefficiency to exploit evident skew in the underlying data distribution and then propose a novel framework that automatically detects correlations and computes an RHP embedding in the Hamming cube tailored to the provided data set using the idea of derived dimensions we first introduce. We further discuss extensions of our framework in order to cope with changes in the data distribution. In such cases, our technique automatically reverts to the basic RHP model for data items that cannot be described accurately through the computed embedding. Our experimental evaluation using several real and synthetic data sets demonstrates that our proposed scheme outperforms the existing RHP algorithm and alternative techniques that have been proposed, providing significantly more accurate similarity computations using the same number of bits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: STOC (1996)

  2. Andoni A., Indyk P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  3. Athitsos V., Alon J., Sclaroff S., Kollios G.: BoostMap: an embedding method for efficient nearest neighbor retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 30(1), 89–104 (2008)

    Article  Google Scholar 

  4. Athitsos, V., Potamias, M., Papapetrou, P., Kollios, G.: Nearest neighbor retrieval using distance-based hashing. In: ICDE, pp. 327–336 (2008)

  5. Bawa, M., Condie, T., Ganesan, P.: Lsh Forest: self-tuning indexes for similarity search. In: WWW, pp. 651–660 (2005)

  6. Berchtold, S., Böhm, C., Kriegel, H.-P.: The pyramid-technique: towards breaking the curse of dimensionality. In: SIGMOD Conference, pp. 142–153 (1998)

  7. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC (2002)

  8. Chávez E., Navarro G., Baeza-Yates R.A., Marroquín J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)

    Article  Google Scholar 

  9. Chen, J., Kher, S., Somani, A.: Distributed fault detection of wireless sensor networks. In: DIWANS (2006)

  10. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG (2004)

  11. Deligiannakis, A., Kotidis, Y., Roussopoulos, N.: Compressing historical information in sensor networks. In: ACM SIGMOD (2004)

  12. Deligiannakis, A., Kotidis, Y., Roussopoulos, N.: Hierarchical in-network data aggregation with quality guarantees. In: Proceedings of EDBT (2004)

  13. Deligiannakis, A., Kotidis, Y., Vassalos, V., Stoumpos, V., Delis, A.: Another outlier bites the dust: computing meaningful aggregates in sensor networks. In: ICDE (2009)

  14. Dobra A., Garofalakis M.N., Gehrke J., Rastogi R.: Multi-query optimization for sketch-based estimation. Inf. Syst. 34(2), 209–230 (2009)

    Article  Google Scholar 

  15. Dong, W., Wang, Z., Josephson, W., Charikar, M., Li, K.: Modeling LSH for performance tuning. In: CIKM, pp. 669–678 (2008)

  16. Faloutsos, C. and Lin, K.-I.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: SIGMOD Conference, pp. 163–174 (1995)

  17. Giatrakos, N., Kotidis, Y., Deligiannakis, A.: PAO: power-efficient attribution of outliers in wireless sensor networks. In: DMSN (2010)

  18. Giatrakos, N., Kotidis, Y., Deligiannakis, A., Vassalos, V., Theodoridis, Y.: TACO: tunable approximate computation of outliers in wireless sensor networks. In: SIGMOD (2010)

  19. Gilbert, A., Kotidis, Y., Muthukrishnan, S., Strauss, M.: QuickSAND: quick summary and analysis of network data. Technical report, DIMACS 2001-43, Dec (2001)

  20. Gilbert, A.C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Fast, small-space algorithms for approximate histogram maintenance. In: STOC, pp. 389–398 (2002)

  21. Gilbert A.C., Kotidis Y., Muthukrishnan S., Strauss M.: One-pass wavelet decompositions of data streams. IEEE Trans. Knowl. Data Eng. 15(3), 541–554 (2003)

    Article  Google Scholar 

  22. Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tunable similar set retrieval. In: SIGMOD (2001)

  23. Goemans, M., Williamson, D.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6) (1995)

  24. Guha S., Koudas N., Shim K.: Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst. 31(1), 396–438 (2006)

    Article  Google Scholar 

  25. Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: EDBT, pp. 744–755 (2009)

  26. Hua, Y., Xiao, B., Feng, D., Yu, B.: Bounded LSH for similarity search in peer-to-peer file systems. In: ICPP, pp. 644–651 (2008)

  27. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)

  28. Jagadish, H.V., Ooi, B.C., and Vu, Q.H.: BATON: a balanced tree structure for peer-to-peer networks. In: VLDB, pp. 661–672 (2005)

  29. Jeffery, S., Alonso, G., Franklin, M.J., Hong, W., Widom, J.: Declarative support for sensor data cleaning. In: Pervasive (2006)

  30. Traina C.T.A.J.M. Jr., Faloutsos C., Seeger B.: Fast indexing and visualization of metric data sets using slim-trees. IEEE Trans. Knowl. Data Eng. 14(2), 244–260 (2002)

    Article  Google Scholar 

  31. Kalnis P., Ng W.S., Ooi B.C., Tan K.-L.: Answering similarity queries in peer-to-peer networks. Inf. Syst. 31(1), 57–72 (2006)

    Article  Google Scholar 

  32. Keogh E.J., Chakrabarti K., Pazzani M.J., Mehrotra S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001)

    Article  MATH  Google Scholar 

  33. Kim, J.W., Candan, K.S.: Skip-and-prune: cosine-based top-k query processing for efficient context-sensitive document retrieval. In SIGMOD Conference, pp. 115–126 (2009)

  34. Kotidis, Y., Deligiannakis, A., Stoumpos, V., Vassalos, V., Delis, A.: Robust management of outliers in sensor network aggregate queries. In: MobiDE (2007)

  35. Koudas, N., Marathe, A., Srivastava, D.: Propagating updates in SPIDER. In: ICDE, pp. 1146–1153 (2007)

  36. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)

  37. Madden S., Franklin M.J., Hellerstein J.M., Hong W.: TAG: a tiny aggregation service for ad hoc sensor networks. In: OSDI Conference (2002)

  38. Ravichandran, D., Pantel, P., Hovy, E.: Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering. In: ACL (2005)

  39. Sacharidis D., Deligiannakis A., Sellis T.K.: Hierarchically compressed wavelet synopses. VLDB J. 18(1), 203–231 (2009)

    Article  Google Scholar 

  40. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD Conference, pp. 563–576 (2009)

  41. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), (2010)

  42. Vlachou, A., Doulkeridis, C., Kotidis, Y.: Peer-to-peer similarity search based on M-tree indexing. In: DASFAA (2), pp. 269–275 (2010)

  43. Vlachou, A., Doulkeridis, C., Kotidis, Y., Vazirgiannis, M.: SKYPEER: efficient subspace skyline computation over distributed data. In: Proceedings of ICDE (2007)

  44. Xiao, X., Peng, W., Hung, C., Lee, W.: Using sensorranks for in-network detection of faulty readings in wireless sensor networks. In: MobiDE (2007)

  45. Xue, G., Jiang, Y., You, Y., Li, M.: A topology-aware hierarchical structured overlay network based on locality sensitive hashing scheme. In: UPGRADE (2007)

  46. Yu, C., Ooi, B.C., Tan, K.-L., Jagadish, H.V.: Indexing the Distance: An Efficient Method to KNN Processing. In: VLDB, pp. 421–430 (2001)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Konstantinos Georgoulas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Georgoulas, K., Kotidis, Y. Distributed similarity estimation using derived dimensions. The VLDB Journal 21, 25–50 (2012). https://doi.org/10.1007/s00778-011-0233-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-011-0233-y

Keywords

Navigation