A graph-based cache for large-scale similarity search engines

Gil-Costa, Veronica; Marin, Mauricio; Bonacic, Carolina; Solar, Roberto

doi:10.1007/s11227-017-2207-3

A graph-based cache for large-scale similarity search engines

Published: 07 December 2017

Volume 74, pages 2006–2034, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Veronica Gil-Costa^1,2,
Mauricio Marin^3,4,
Carolina Bonacic^3,4 &
…
Roberto Solar⁵

302 Accesses
3 Citations
Explore all metrics

Abstract

Large-scale similarity search engines are complex systems devised to process unstructured data like images and videos. These systems are deployed on clusters of distributed processors communicated through high-speed networks. To process a new query, a distance function is evaluated between the query and the objects stored in the database. This process relays on a metric space index distributed among the processors. In this paper, we propose a cache-based strategy devised to reduce the number of computations required to retrieve the top-k object results for user queries by using pre-computed information. Our proposal executes an approximate similarity search algorithm, which takes advantage of the links between objects stored in the cache memory. Those links form a graph of similarity among pre-computed queries. Compared to the previous methods in the literature, the proposed approach reduces the number of distance evaluations up to 60%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of Static/Dynamic Cache for Similarity Search Engines

Computational Enhancements of HNSW Targeted to Very Large Datasets

Optimizing Query Performance with Inverted Cache in Metric Spaces

Notes

References

Al-Fares M, Loukissas A, Vahdat A (2008) A scalable, commodity data center network architecture. SIGCOMM Comput Commun Rev 38(4):63–74
Article Google Scholar
Amato G, Esuli A, Falchi F (2013) Pivot selection strategies for permutation-based similarity search. In: SISAP, pp 91–102
Amato G, Esuli A, Falchi E (2015) A comparison of pivot selection techniques for permutation-based indexing. J Inf Syst 52(C):176–188
Article Google Scholar
Amato G, Savino P (2008) Approximate similarity search in metric spaces using inverted files. In: InfoScale, pp 28:1–28:10
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. J Commun ACM 51(1):117–122
Article Google Scholar
Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY (1998) An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM 45(6):891–923
Article MathSciNet MATH Google Scholar
Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval, 2nd edn. Addison-Wesley Publishing Company, Reading
Google Scholar
Brisaboa NR, Cerdeira-Pena A, Gil-Costa V, Marín M, Pedreira O (2015) Efficient similarity search by combining indexing and caching strategies. In: SOFSEM, pp 486–497
Burkhard WA, Keller RM (1973) Some approaches to best-match file searching. J Commun ACM 4(16):230–236
Article MATH Google Scholar
Bustos B, Navarro G, Chávez E (2003) Pivot selection techniques for proximity searching in metric spaces. J Pattern Recognit Lett 24(14):2357–2366
Article MATH Google Scholar
Bustos B, Pedreira O, Brisaboa N (2008) A dynamic pivot selection technique for similarity search. In: SISAP, pp 394–401
Cao W, Sahin S, Liu L, Bao X (2016) Evaluation and analysis of in-memory key-value systems. In: BigData, pp 26–33
Chávez E, Figueroa K, Navarro G (2008) Effective proximity retrieval by ordering permutations. J Pattern Anal Manag Intell 30:1647–1658
Article Google Scholar
Chávez E, Ludueña V, Reyes N, Roggero P (2016) Faster proximity searching with the distal SAT. J Inf Syst 59:15–47
Article Google Scholar
Chávez E, Marroquin J, Navarro G (2001) Fixed queries array: a fast and economical data structure for proximity searching. J Multimed Tools Appl 14(2):113–135
Article MATH Google Scholar
Chávez E, Navarro G (2005) A compact space decomposition for effective metric indexing. J Pattern Recogn Lett 26(9):1363–1376
Article Google Scholar
Chierichetti F, Kumar R, Vassilvitskii S (2009) Similarity caching. In: SIGMOD-SIGACT-SIGART, pp 127–136
Ciaccia P, Patella M (2000) PAC nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp 244–255
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp 426–435
Dehne F, Noltemeier H (1988) Voronoi trees and clustering problems. In: Syntactic and structural, pattern recognition, pp 185–194
Egecioglu Ö, Ferhatosmanoglu H, Ogras ÜY (2004) Dimensionality reduction and similarity computation by inner-product approximations. IEEE Trans Knowl Data Eng 16(6):714–726
Article Google Scholar
Esuli A (2009) Mipai: using the pp-index to build an efficient and scalable similarity search system. In: SISAP, pp 146–148
Esuli A (2010) Pp-index: using permutation prefixes for efficient and scalable similarity search. In: SEBD, pp 318–325
Falchi F, Lucchese C, Orlando S, Perego R, Rabitti F (2008) A metric cache for similarity search. In: LSDS-IR, pp 43–50
Falchi F, Lucchese C, Orlando S, Perego R, Rabitti F (2009) Caching content-based queries for robust and efficient image retrieval. In: EDBT, pp 780–790
Falchi F, Lucchese C, Orlando S, Perego R, Rabitti F (2011) Similarity caching in large-scale image retrieval. J Inf Process Manag 48(5):803–818
Faloutsos C, Lin K-I (1995) Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: SIGMOD, pp 163–174
Ferhatosmanoglu H, Tuncel E, Agrawal D, El Abbadi A (2001) Approximate nearest neighbor searching in multimedia databases. In: ICDE, pp 503–511
Figueroa K, Paredes R (2015) Boosting the permutation based index for proximity searching. In: MCPR, pp 103–112
Gennaro C, Amato G, Bolettieri P, Savino P (2010) An approach to content-based image retrieval based on the lucene search engine library. In: ECDL, pp 55–66
Gessert F, Wingerath W, Friedrich S, Ritter N (2017) Nosql database systems: a survey and decision guidance. J Comput Sci R&D 32(3–4):353–365
Gil-Costa V, Marin M (2011) Approximate distributed metric-space search. In: LSDS-IR, pp 15–20
Gil-Costa V, Marin M, Reyes N (2009) Parallel query processing on distributed clustering indexes. J Discrete Algorithms 7(1):3–17
Article MathSciNet MATH Google Scholar
Gil-Costa V, Santos RLT, Macdonald C, Ounis I (2013) Modelling efficient novelty-based search result diversification in metric spaces. J Discrete Algorithms 18:75–88
Article MathSciNet MATH Google Scholar
Hersh W, Turpin A, Price S, Chan B, Kramer D, Sacherek L, Olson D (2000) Do batch and user evaluations give the same results? In: SIGIR, pp 17–24
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp 604–613
Ingwersen P, Järvelin K (2005) The turn: integration of information seeking and retrieval in context (The Information Retrieval Series). Springer, New York Inc, Secaucus
MATH Google Scholar
Johnston N, Vincent D, Minnen D, Covell M, Singh S, Chinen TT, Hwang SJ, Shor J, Toderici G (2017) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. CoRR, arXiv:abs/1703.10114
Karypis G (2003) Cluto-software for clustering high-dimensional datasets, version 2.1.1. http://glaros.dtc.umn.edu/gkhome/views/cluto
Lux M, Chatzichristofis SA (2008) Lire: lucene image retrieval: an extensible java cbir library. In: Conference on Multimedia, pp 1085–1088
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Berkeley Symposium on Mathematical Statistics and Probability, vol 1, pp 281–297
Mancini V, Bustos F, Gil-Costa V, Printista AM (2012) Data partitioning evaluation for multimedia systems in hybrid environments. In: 3PGCIC, pp 321–326
Marin M, Ferrarotti F, Gil-Costa V (2010) Distributing a metric-space search index onto processors. In: ICPP, pp 13–16
Marin M, Gil-Costa V, Uribe R (2008) Hybrid index for metric space databases. In: ICCS, pp 327–336
Matej A, Vlastislav D (2016) Optimizing query performance with inverted cache in metric spaces. In: ADBIS, pp 60–73
Micó ML, Oncina J, Vidal E (1994) A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. J Pattern Recognit Lett 15(1):9–17
Article Google Scholar
Navarro G (2002) Searching in metric spaces by spatial approximation. In: VLDB, pp 28–46
Navarro G, Reyes N (2002) Fully dynamic spatial approximation trees. In: SPIRE, pp 254–270
Navarro G, Reyes N (2009) Dynamic spatial approximation trees for massive data. In: SISAP, pp 81–88
Novak D, Batko M (2009) Metric index: an efficient and scalable solution for similarity search. In: SISAP, pp 65–73
Novak D, Batko M, Zezula P (2012) Large-scale similarity data management with distributed metric index. J Inf Process Manag 48(5):855–872
Article Google Scholar
Novak D, Zezula P (2016) PPP-codes for large-scale similarity searching. In: Database and expert-systems applications on transactions on large-scale data- and knowledge-centered systems, pp 61–87
Pedreira O, Brisaboa NR (2007) Sofsem. In: Theory and practice of computer science, pp 434–445
Ogras ÜY, Ferhatosmanoglu H (2003) Dimensionality reduction using magnitude and shape approximations. In: CIKM, pp 99–107
Pan Z, Lei J, Zhang Y, Sun X, Kwong S (2016) Fast motion estimation based on content property for low-complexity H.265/HEVC encoder. J IEEE Trans Broadcast 62(3):675–684
Article Google Scholar
Pandey S, Broder A, Chierichetti F, Josifovski V, Kumar R, Vassilvitskii S (2009) Nearest-neighbor caching for content-match applications. In: WWW, pp 441–450
Pramanik S, Alexander S, Li J (1999) An efficient searching algorithm for approximate nearest neighbor queries in high dimensions. IEEE Multimed Comput Syst 1:865–869
Google Scholar
Raghavendra S, Nithyashree K, Geeta CM, Buyya R, Venugopal KR, Iyengar SS, Patnaik LM (2016) RSSMSO rapid similarity search on metric space object stored in cloud environment. J Organ Collect Intell 6(3):33–49
Article Google Scholar
Ruqeishi K, Koneuay M (2015) Regrouping metric-space search index for search engine size adaptation. In: Similarity search and applications, pp 271–282
Saavedra JM, Barrios JM (2015) Sketch based image retrieval using learned keyshapes (LKS). In: British Machine Vision Conference, pp 164.1–164.11
Skala M (2009) Counting distance permutations. J Discrete Algorithms 7(1):49–61
Article Google Scholar
Skillicorn DB, Hill JMD, McColl WF (2000) Mpeg-7. Multimedia content description interfaces, part 3: visual. Technical Report ISO/IEC 15938-3
Skopal T, Lokoc J, Bustos B (2012) D-cache: universal distance cache for metric access methods. J Trans Knowl Data Eng 24(5):868–881
Article Google Scholar
Solar R, Gil-Costa V, Marín M (2016) Evaluation of static/dynamic cache for similarity search engines. In: SOFSEM, pp 615–627
Sadit Tellez E, Chvez E (2012) The list of clusters revisited. In: Pattern recognition, pp 187–196
Wang X, Wang JTL, Lin K-I, Shasha D, Shapiro BA, Zhang K (2000) An index structure for data mining and clustering. J Knowl Inf Syst 2:161–184
Article MATH Google Scholar
Weber R, Böhm K (2000) Trading quality for time with nearest neighbor search. In: Extending database technology: advances in database technology, pp 21–35
Wei W, Fan X, Song H, Fan X, Yang J (2017) Imperfect information dynamic stackelberg game based resource allocation using hidden Markov for cloud computing. J IEEE Trans Serv Comput PP(99):1–1
White D, Jain R (1996) Algorithms and strategies for similarity retrieval. Technical Report VCL-96-101, Visual Computing Laboratory, University of California San Diego
Xia Z, Wang X, Zhang L, Qin Z, Sun X, Ren K (2016) A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing. J IEEE Trans Inf Forensics Secur 11(11):2594–2608
Article Google Scholar
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach, advances in database systems. Springer, Berlin
MATH Google Scholar
Zhou Z, Wang Y, Wu QMJ, Yang CN, Sun X (2017) Effective and efficient global context verification for image copy detection. J IEEE Trans Inf Forensics Secur 12(1):48–63
Article Google Scholar
Zhou Z, Wu QMJ, Huang F, Sun X (2017) Fast and accurate near-duplicate image elimination for visual sensor networks. J Distrib Sens Netw 13(2):1–1
Google Scholar

Download references

Acknowledgements

This research was supported by the supercomputing infrastructure of the NLHPC Chile, partially funded by CONICYT Basal Funds FB0001, Fondef ID15I10560, and partially funded by PICT 2014 N 2014-01146.

Author information

Authors and Affiliations

Universidad Nacional de San Luis, San Luis, Argentina
Veronica Gil-Costa
CONICET, San Luis, Argentina
Veronica Gil-Costa
CeBiB, Centre for Biotechnology and Bioengineering, Santiago, Chile
Mauricio Marin & Carolina Bonacic
DIINF, Universidad de Santiago de Chile, Santiago, Chile
Mauricio Marin & Carolina Bonacic
CITIAPS, Universidad de Santiago de Chile, Santiago, Chile
Roberto Solar

Authors

Veronica Gil-Costa
View author publications
You can also search for this author inPubMed Google Scholar
Mauricio Marin
View author publications
You can also search for this author inPubMed Google Scholar
Carolina Bonacic
View author publications
You can also search for this author inPubMed Google Scholar
Roberto Solar
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Veronica Gil-Costa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gil-Costa, V., Marin, M., Bonacic, C. et al. A graph-based cache for large-scale similarity search engines. J Supercomput 74, 2006–2034 (2018). https://doi.org/10.1007/s11227-017-2207-3

Download citation

Published: 07 December 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s11227-017-2207-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A graph-based cache for large-scale similarity search engines

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluation of Static/Dynamic Cache for Similarity Search Engines

Computational Enhancements of HNSW Targeted to Very Large Datasets

Optimizing Query Performance with Inverted Cache in Metric Spaces

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now