Abstract
Access methods are a fundamental tool on Information Retrieval. However, most of these methods suffer the problem known as the curse of dimensionality when they are applied to objects with very high dimensionality representation spaces, such as text documents. In this paper we introduce a new parallel access method that uses several graphs as distributed index structure and a kNN search algorithm. Two parallel versions of the search method are presented, one based on master–slave scheme and the other based on a pipeline. A thorough experimental analysis on different datasets shows that our method can process efficiently large flows of queries, compete with other parallel algorithms and obtain at the same time very high quality results.





Similar content being viewed by others
References
Ares LG, Brisaboa NR, Pereira AO, Pedreira O (2012) Efficient similarity search in metric spaces with cluster reduction. In: Proceedings Similarity Search and Applications—5th International Conference, SISAP 2012, Toronto, ON, Canada, August 9–10, 2012, pp 70–84
Artigas-Fuentes FJ, Gil-García R, Badía-Contelles JM (2010) A high-dimensional access method for approximated similarity search in text mining. In: 20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23–26 August 2010, pp 3155–3158
Artigas-Fuentes FJ, Gil-García R, Badía-Contelles JM, Pons-Porrata A (2010) Fast k-nn classifier for documents based on a graph structure. In: Proceedings Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications—15th Iberoamerican Congress on Pattern Recognition, CIARP 2010, Sao Paulo, Brazil, November 8–11, 2010, pp 228–235
Aydin B (2014) Parallel algorithms on nearest neighbor search. Survey paper, Georgia State University
Baeza-Yates RA, Ribeiro-Neto BA (2011) Modern Information Retrieval—the concepts and technology behind search, 2nd edn. Pearson Education Ltd., England
Barrientos RJ, Gómez JI, Tenllado C, Prieto-Matías M, Marín M (2011) kNN query processing in metric spaces using GPUs. In: Proceedings Euro-Par 2011 Parallel Processing—17th International Conference, Euro-Par 2011, Bordeaux, France, August 29–September 2, 2011. Part I, pp 380–392
Barrientos RJ, Gómez JI, Tenllado C, Prieto-Matías M, Marín M (2013) Range query processing on single and multi GPU environments. Comput Electr Eng 39(8):2656–2668
Chávez E, Navarro G (2005) A compact space decomposition for effective metric indexing. Pattern Recognit Lett 26(9):1363–1376
Costa VG, Barrientos RJ, Marín M, Bonacic C (2010) Scheduling metric-space queries processing on multi-core processors. In: Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, PDP 2010, Pisa, Italy, February 17–19, 2010, pp 187–194
Costa VG, Marín M (2008) Distributed sparse spatial selection indexes. 16th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), 13–15 February 2008. Toulouse, France, pp 440–444
Costa VG, Marín M, Reyes N (2009) Parallel query processing on distributed clustering indexes. J Discrete Algorithms 7(1):3–17
Dashti A (2013) Efficient computation of k-nearest neighbor graphs for large high-dimensional data sets on GPU clusters. Master’s thesis, University of Wisconsin-Milwaukee, Paper 280
Dong W (2011) High-dimensional similarity search for large datasets. Ph.D. thesis, Department of Computer Science, Princeton University
Garcia V, Nielsen F (2009) Searching high-dimensional neighbours: CPU-based tailored data-structures versus GPU-based brute-force method. In: MIRAGE. 4th International Conference on Computer Vision/Computer Graphics Collaboration Techniques, pp 425–436
Kamble A (2014) Survey of text categorization techniques. IJRCCT 3(7):720–723
Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: A new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Marín M, Reyes N (2005) Efficient parallelization of spatial approximation trees. In: Proceedings Computational Science—ICCS 2005, 5th International Conference, Atlanta, GA, USA, May 22–25, 2005, Part I, pp 1003–1010
Marín M, Uribe R, Barrientos RJ (2007) Searching and updating metric space databases using the parallel EGNAT. In: Proceedings Computational Science—ICCS 2007, 7th International Conference Beijing, China, May 27–30, 2007, Part I, pp 229–236
Naidan B, Boytsov L, Nyberg E (2015) Permutation search methods are efficient, yet faster search is possible. PVLDB 8(12):1618–1629
Pan J, Manocha D (2011) Fast GPU-based locality sensitive hashing for k-nearest neighbor computation. In: Proceedings 19th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, ACM-GIS 2011 (November), pp 1–4 (2011) Chicago, IL, USA, pp 211–220
Paredes R (2008) Graph for metric space searching. Ph.D. thesis, Universidad de Chile
Radovanovic M, Nanopoulos A, Ivanovic M (2010) Hubs in space: Popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531
Acknowledgments
This research has been supported by the CICYT project TIN2014-53495-R of the Ministerio de Economía y Competitividad.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Artigas-Fuentes, F.J., Badía, J.M. Accessing very high dimensional spaces in parallel. J Supercomput 73, 176–189 (2017). https://doi.org/10.1007/s11227-016-1673-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1673-3