Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces

Fisichella, Marco; Ceroni, Andrea; Deng, Fan; Nejdl, Wolfgang

doi:10.1007/978-3-319-10085-2_5

Marco Fisichella²⁰,
Andrea Ceroni²⁰,
Fan Deng²⁰ &
…
Wolfgang Nejdl²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8645))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1465 Accesses

Abstract

The problem of near–duplicate detection consists in finding those elements within a data set which are closest to a new input element, according to a given distance function and a given closeness threshold. Solving such problem for high–dimensional data sets is computationally expensive, since the amount of computation required to assess the similarity between any two elements increases with the number of dimensions. As a motivating example, an image or video sharing website would take advantage of detecting near–duplicates whenever new multimedia content is uploaded. Among different approaches, near–duplicate detection in high–dimensional data sets has been effectively addressed by SimPair LSH [11]. Built on top of Locality Sensitive Hashing (LSH), SimPair LSH computes and stores a small set of near-duplicate pairs in advance, and uses them to prune the candidate set generated by LSH for a given new element. In this paper, we develop an algorithm to predict a lower bound of the number of elements pruned by SimPair LSH from the candidate set generated by LSH. Since the computational overhead introduced by SimPair LSH to compute near-duplicate pairs in advance is rewarded by the possibility of using that information to prune the candidate set, predicting the number of pruned points would be crucial. The pruning prediction has been evaluated through experiments over three real–world data sets. We also performed further experiments on SimPair LSH, confirming that it consistently outperforms LSH with respect to memory space and running time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Siamese coding network and pair similarity prediction for near-duplicate image detection

Article Open access 12 April 2022

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches

References

Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1) (2008)
Google Scholar
Andoni, A., Indyk, P., Nguyen, H.L., Razenshteyn, I.: Beyond locality-sensitive hashing. CoRR, abs/1306.1547 (2013)
Google Scholar
Andoni, A., Indyk, P., Patrascu, M.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS (2006)
Google Scholar
Bahmani, B., Goel, A., Shinde, R.: Efficient distributed locality sensitive hashing. In: CIKM (2012)
Google Scholar
Bawa, M., Condie, T., Ganesan, P.: Lsh forest: self-tuning indexes for similarity search. In: WWW (2005)
Google Scholar
Bellman, R.E.: Adaptive control processes - A guided tour (1961)
Google Scholar
Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: CIVR (2007)
Google Scholar
Dasgupta, A., et al.: Fast locality-sensitive hashing. In: KDD (2011)
Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG (2004)
Google Scholar
Deng, F.: Approximately detecting duplicates for streaming data using stable bloom filters. In: SIGMOD (2006)
Google Scholar
Fisichella, M., Deng, F., Nejdl, W.: Efficient incremental near duplicate detection based on locality sensitive hashing. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010, Part I. LNCS, vol. 6261, pp. 152–166. Springer, Heidelberg (2010)
Chapter Google Scholar
Gao, L., Wang, X.S.: Continuous similarity-based queries on streaming time series. IEEE Trans. on Knowl. and Data Eng. 17(10) (2005)
Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB (1999)
Google Scholar
Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. (2006)
Google Scholar
Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: EDBT (2009)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)
Google Scholar
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2) (2008)
Google Scholar
Koudas, N., Chin, B., Kian-lee, O., Zhang, T.R.: Approximate nn queries on streams with guaranteed error/performance bounds. In: VLDB (2004)
Google Scholar
Lian, X., Chen, L.: Efficient similarity search over future stream time series. IEEE Trans. on Knowl. and Data Eng. 20(1) (2008)
Google Scholar
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: efficient indexing for high-dimensional similarity search. In: VLDB (2007)
Google Scholar
Teixeira, T., et al.: Scalable locality-sensitive hashing for similarity search in high-dimensional, large-scale multimedia datasets. CoRR, abs/1310.4136 (2013)
Google Scholar
Torralba, A., Fergus, R., Freeman, W.: Tech. rep. mit-csail-tr-2007-024. Technical report, Massachusetts Institute of Technology (2007)
Google Scholar
Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

L3S Research Center, Hannover, Germany
Marco Fisichella, Andrea Ceroni, Fan Deng & Wolfgang Nejdl

Authors

Marco Fisichella
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Ceroni
View author publications
You can also search for this author in PubMed Google Scholar
Fan Deng
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Nejdl
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto Tecnológico de Informática, 46022, Valencia, Spain
Hendrik Decker
Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University in Prague, 166 27, Prague 6, Czech Republic
Lenka Lhotská
Department of Computer Science, The University of Auckland, 1010, Auckland, New Zealand
Sebastian Link
Knowledge Management, LMU University of Munich, Leopoldstraße 13, 80802, Munich, Germany
Marcus Spies
FAW, University of Linz, Altenbergerstrasse 69, 4040, Linz, Austria
Roland R. Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fisichella, M., Ceroni, A., Deng, F., Nejdl, W. (2014). Predicting Pair Similarities for Near-Duplicate Detection in High Dimensional Spaces. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds) Database and Expert Systems Applications. DEXA 2014. Lecture Notes in Computer Science, vol 8645. Springer, Cham. https://doi.org/10.1007/978-3-319-10085-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-10085-2_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10084-5
Online ISBN: 978-3-319-10085-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics