Skip to main content
Log in

Auto-tuning Similarity Search Algorithms on Multi-core Architectures

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

In recent times, large high-dimensional datasets have become ubiquitous. Video and image repositories, financial, and sensor data are just a few examples of such datasets in practice. Many applications that use such datasets require the retrieval of data items similar to a given query item, or the nearest neighbors (NN or \(k\)-NN) of a given item. Another common query is the retrieval of multiple sets of nearest neighbors, i.e., multi \(k\)-NN, for different query items on the same data. With commodity multi-core CPUs becoming more and more widespread at lower costs, developing parallel algorithms for these search problems has become increasingly important. While the core nearest neighbor search problem is relatively easy to parallelize, it is challenging to tune it for optimality. This is due to the fact that the various performance-specific algorithmic parameters, or “tuning knobs”, are inter-related and also depend on the data and query workloads. In this paper, we present (1) a detailed study of the various tuning knobs and their contributions on increasing the query throughput for parallelized versions of the two most common classes of high-dimensional multi-NN search algorithms: linear scan and tree traversal, and (2) an offline auto-tuner for setting these knobs by iteratively measuring actual query execution times for a given workload and dataset. We show experimentally that our auto-tuner reaches near-optimal performance and significantly outperforms un-tuned versions of parallel multi-NN algorithms for real video repository data on a variety of multi-core platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. This is just an analogy, not to be confused with the SIMD instructions, like SSE, which we discuss later.

  2. Horizontal addition of a vector of \(4\) floats, which is an operation needed for distance computation via SIMD instructions, requires \(2\) SIMD instructions with SSE3, which is only \(2\) (\(<4\)) times better than the scalar version of the same computation.

  3. If the blocks are sufficiently small, this may still be advantageous if the index trees fit inside L2 or L3 cache.

References

  1. Advanced Micro Devices: AMD Athlon 64 X2 Dual-Core Processor Product Data Sheet. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/33425.pdf (2007)

  2. Arge, L.: The buffer tree: a new technique for optimal i/o-algorithms (extended abstract). In: Proceedings of the 4th International Workshop on Algorithms and Data Structures (WADS), pp. 334–345. Springer, London (1995)

  3. Cascaval, C., Duesterwald, E., Sweeney, P., Wisniewski, R.W.: Multiple page size modeling and optimization. In: Proceedings of the Parallel Architectures and Compilation, Techniques (PACT). pp. 339–349 (2005)

  4. Chen, C., Chame, J., Hall, M.W.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO) (2005)

  5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the Computer Vision and Pattern Recognition, Workshop (CVPR), pp. 886–893 (2005)

  6. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC). pp. 1–12 (2008)

  7. Faloutsos, C., Lin, K.-I.: Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. SIGMOD Rec. 24(2), 163–174 (1995)

    Article  Google Scholar 

  8. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977)

    Article  MATH  Google Scholar 

  9. Girbal, S., Vasilache, N., Bastoul, C., Cohen, A., Parello, D., Sigler, M., Temam, O.: Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Program. 34(3), 261–317 (2006)

    Article  MATH  Google Scholar 

  10. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Yormark, B. (ed) Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 47–57 (1984)

  11. Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. IEEE Comput. 41, 33–38 (2008)

    Article  Google Scholar 

  12. Chungand, I.-H., Hollingsworth, J.: Using information from prior runs to improve automated tuning systems. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC) (2004)

  13. Intel Corporation. Intel Itanium 2 Processor Reference Manual. http://download.intel.com/design/Itanium2/manuals/25111003.pdf (2004)

  14. Intel Corporation: The Intel 64 and IA-32 Architectures Optimization Reference Manual. http://download.intel.com/design/processor/manuals/248966.pdf (2008)

  15. Jolliffe, I.T.: Principal Component Analysis. Springer Series in, Statistics (1986)

  16. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  17. Le, H.Q., Starke, W.J., Fields, J.S., O’Connell, F.P., Nguyen, D.Q., Ronchetti, B.J., Sauer, W.M., Schwarz, E.M., Vaden, M.T.: IBM POWER6 microarchitecture. IBM J. Res. Dev. 51(6), 639–662 (2007)

    Article  Google Scholar 

  18. Nelson, Y., Bansal, B., Hall, M., Nakano, A., Lerman, K.: Model-guided performance tuning of parameter values: a case study with molecular dynamics visualization. In: Proceedingsof the International Symposium on Parallel and Distributed Processing, pp. 1–8 (2008)

  19. NIST: NIST Special Publication: SP 500–274 (Proceedings of The Sixteenth Text REtrieval Conference (TREC) 2007). http://trec.nist.gov/pubs/trec16/t16_proceedings.html. 2007

  20. NIST: The Digital Millennium Copyright Act of 1998. http://www.copyright.gov/legislation/dmca.pdf (2011)

  21. Qiao, L., Raman, V., Reiss, F., Haas, P.J., Lohman, G.M.: Main-memory scan sharing for multi-core cpus. Very Larg Data Bases J (VLDBJ) 1(1), 610–621 (2008)

    Google Scholar 

  22. Seidl, T., Kriegel, H.-P.: Optimal multi-step k-nearest neighbor search. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 154–165 (1998)

  23. Voss, M., Eigenmann, R.: ADAPT: automated de-coupled adaptive program transformation. In: Proceedings of the International Conference on Parallel Processing, pp. 163–170 (2000)

  24. Vuduc, R.W.: Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, Dec 2003

  25. Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the International Conference on Very Large Data Bases (VLDB) (1998)

  26. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput 27(1–2), 3–35 (2001)

    Article  MATH  Google Scholar 

  27. White, D.A., Jain, R.: Similarity indexing with the ss-tree. In: Proceedings of the IEEE International Conference on Data, Engineering (ICDE). pp. 516–523 (1996)

  28. Williams, S.W.: Auto-tuning performance on multicore computers. Technical report, Electrical Engineering and Computer Sciences, University of California at Berkeley (2008)

  29. Yotov, K., Li, X., Ren, G., Cibulskis, M., DeJong, G., Garzarn, M.J., Padua, D.A., Pingali, K., Stodghill, P., Wu, P.: A comparison of empirical and model-driven optimization. In: Proceedings of the ACM Programming Language Design and Implementation Conference (PLDI) (2003)

  30. Yotov, K., Li, X., Ren, G., Garzaran, M., Padua, D., Pingali, K., Stodghill, P.: Is search really necessary to generate high-performance BLAS? In: Proceedings of the IEEE: Special Issue on Program Generation, Optimization, and Platform Adaptation, vol. 93(2). pp. 358–386 (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Buğra Gedik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gedik, B. Auto-tuning Similarity Search Algorithms on Multi-core Architectures. Int J Parallel Prog 41, 595–620 (2013). https://doi.org/10.1007/s10766-013-0239-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0239-8

Keywords

Navigation