Skip to main content

Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8645))

Abstract

The problem of near–duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high–dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near–duplicates whenever new multimedia content is uploaded. Among different approaches, near–duplicate detection in high–dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real–world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1) (2008)

    Google Scholar 

  2. Andoni, A., Indyk, P., Nguyen, H.L., Razenshteyn, I.: Beyond locality-sensitive hashing. CoRR, abs/1306.1547 (2013)

    Google Scholar 

  3. Andoni, A., Indyk, P., Patrascu, M.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS (2006)

    Google Scholar 

  4. Bahmani, B., Goel, A., Shinde, R.: Efficient distributed locality sensitive hashing. In: CIKM (2012)

    Google Scholar 

  5. Bawa, M., Condie, T., Ganesan, P.: Lsh forest: self-tuning indexes for similarity search. In: WWW (2005)

    Google Scholar 

  6. Bellman, R.E.: Adaptive control processes - A guided tour (1961)

    Google Scholar 

  7. Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: CIVR (2007)

    Google Scholar 

  8. Dasgupta, A., et al.: Fast locality-sensitive hashing. In: KDD (2011)

    Google Scholar 

  9. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG (2004)

    Google Scholar 

  10. Deng, F.: Approximately detecting duplicates for streaming data using stable bloom filters. In: SIGMOD (2006)

    Google Scholar 

  11. Fisichella, M., Deng, F., Nejdl, W.: Efficient incremental near duplicate detection based on locality sensitive hashing. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010, Part I. LNCS, vol. 6261, pp. 152–166. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  12. Gao, L., Wang, X.S.: Continuous similarity-based queries on streaming time series. IEEE Trans. on Knowl. and Data Eng. 17(10) (2005)

    Google Scholar 

  13. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB (1999)

    Google Scholar 

  14. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. (2006)

    Google Scholar 

  15. Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: EDBT (2009)

    Google Scholar 

  16. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)

    Google Scholar 

  17. Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2) (2008)

    Google Scholar 

  18. Koudas, N., Chin, B., Kian-lee, O., Zhang, T.R.: Approximate nn queries on streams with guaranteed error/performance bounds. In: VLDB (2004)

    Google Scholar 

  19. Lian, X., Chen, L.: Efficient similarity search over future stream time series. IEEE Trans. on Knowl. and Data Eng. 20(1) (2008)

    Google Scholar 

  20. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: efficient indexing for high-dimensional similarity search. In: VLDB (2007)

    Google Scholar 

  21. Teixeira, T., et al.: Scalable locality-sensitive hashing for similarity search in high-dimensional, large-scale multimedia datasets. CoRR, abs/1310.4136 (2013)

    Google Scholar 

  22. Torralba, A., Fergus, R., Freeman, W.: Tech. rep. mit-csail-tr-2007-024. Technical report, Massachusetts Institute of Technology (2007)

    Google Scholar 

  23. Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Fisichella, M., Ceroni, A., Deng, F., Nejdl, W. (2014). Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds) Database and Expert Systems Applications. DEXA 2014. Lecture Notes in Computer Science, vol 8645. Springer, Cham. https://doi.org/10.1007/978-3-319-10085-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10085-2_5

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10084-5

  • Online ISBN: 978-3-319-10085-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics