skip to main content
research-article
Public Access

Partitioned Similarity Search with Cache-Conscious Data Traversal

Published: 14 April 2017 Publication History

Abstract

All pairs similarity search (APSS) is used in many web search and data mining applications. Previous work has used techniques such as comparison filtering, inverted indexing, and parallel accumulation of partial results. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-phase approach called Partition-based Similarity Search (PSS). The first phase is to partition the data and group vectors that are potentially similar. The second phase is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Due to data sparsity and the presence of memory hierarchy, accessing feature vectors during the partition comparison phase incurs significant overhead. This paper introduces a cache-conscious design for data layout and traversal to reduce access time through size-controlled data splitting and vector coalescing, and it provides an analysis to guide the choice of optimization parameters. The evaluation results show that for the tested datasets, the proposed approach can lead to an early elimination of unnecessary I/O and data communication while sustaining parallel efficiency with one order of magnitude of performance improvement and it can also be integrated with LSH for approximated APSS.

References

[1]
Fabio Aiolli. 2013. Efficient top-n recommendation for very large scale binary rated datasets. In Proceedings of the 7th ACM Conference on Recommender Systems. 273--280.
[2]
Maha Alabduljalil, Xun Tang, and Tao Yang. 2013a. Cache-conscious performance optimization for similarity search. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR).713--722.
[3]
Maha Alabduljalil, Xun Tang, and Tao Yang. 2013b. Optimizing parallel algorithms for all pairs similarity search. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM). 203--212.
[4]
David C. Anastasiu and George Karypis. 2014. L2AP: Fast cosine similarity search with prefix L-2 norm bounds. In Proceedings of IEEE 30th International Conference on Data Engineering (ICDE’14). 784--795.
[5]
Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient exact set-similarity joins. In Proceedings of the 32nd International Conference on Very Large Data Bases. 918--929.
[6]
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison Wesley.
[7]
Ranieri Baraglia, Gianmarco De Francisci Morales, and Claudio Lucchese. 2010. Document similarity self-join with MapReduce. In Proceedings of the 2010 IEEE International Conference on Data Mining. 731--736.
[8]
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. 131--140.
[9]
Peter Boncz, Data Distilleries B. V., Stefan Manegold, and Martin L. Kersten. 1999. Database architecture optimized for the new bottleneck : Memory access. In Proceedings of the 25th International Conference on Very Large Data Bases. 54--65
[10]
Fidel Cacheda, Víctor Carneiro, Diego Fernández, and Vreixo Formoso. 2011. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Trans. Web 5, 1 (Feb. 2011), Article 2.
[11]
Abdur Chowdhury, Ophir Frieder, David A. Grossman, and M. Catherine McCabe. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2 (2002), 171--191.
[12]
J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (March 1990), 1--17.
[13]
Iain S. Duff, Michael A. Heroux, and Roldan Pozo. 2002. An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum. ACM Trans. Math. Softw. 28, 2 (June 2002), 239--267.
[14]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases. 518--529.
[15]
Hannaneh Hajishirzi, Wen-tau Yih, and Aleksander Kolcz. 2010. Adaptive near-duplicate detection via similarity learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10). ACM, New York, NY, 419--426.
[16]
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC’98). 604--613.
[17]
Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). 219--230.
[18]
David Kanter. 2010. MD’s bulldozer microarchitecture. Retrieved from http://www.realworldtech.com/ (2010).
[19]
Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. 2004. Improved robustness of signature-based near-replica detection via lexicon randomization. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 605--610.
[20]
David Levinthal. 2009. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel (2009). https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf.
[21]
Jimmy Lin. 2009. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 155--162.
[22]
Stefan Manegold, Peter Boncz, and Martin L. Kersten. 2002. Generic database cost models for hierarchical memory systems. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02). 191--202.
[23]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2007. Detectives: Detecting coalition hit inflation attacks in advertising networks streams. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 241--250.
[24]
Ahmed Metwally and Christos Faloutsos. 2012. V-SMART-Join : A Scalable MapReduce framework for all-pair similarity joins of multisets and vectors. In Proceedings of the VLDB Endowment, Vol. 5. 704--715.
[25]
Gianmarco De Francisci Morales, Claudio Lucchese, and Ranieri Baraglia. 2010. Scaling out all pairs similarity search with MapReduce. In Proceedings of the 8th Workshop on LargeScale Distributed Systems for Information Retrieval (2010).
[26]
Mehran Sahami and Timothy D. Heilman. 2006. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th International Conference on World Wide Web (WWW’06). 377--386.
[27]
Venu Satuluri and Srinivasan Parthasarathy. 2012. Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endow. 5, 5 (Jan. 2012), 430--441.
[28]
Ambuj Shatdal, Chander Kant, and Jeffrey F. Naughton. 1994. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94). Morgan Kaufmann Publishers Inc, 510--521.
[29]
Kai Shen, Tao Yang, and Xiangmin Jiao. 2000. S+: Efficient 2D sparse LU factorization on parallel machines. SIAM J. Matrix Anal. Appl. 22, 1 (April 2000), 282--305.
[30]
Narayanan Shivakumar and Hector Garcia-Molina. 1996. Building a scalable and accurate copy detection mechanism. In Proceedings of the 1st ACM International Conference on Digital Libraries (DL’96). 160--168.
[31]
Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc. VLDB Endow. 6, 14 (Sept. 2013), 1930--1941.
[32]
Xun Tang, Maha Alabduljalil, Xin Jin, and Tao Yang. 2014. Load balancing for partition-based similarity search. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval - SIGIR’14 (2014). 193--202.
[33]
Martin Theobald. 2008. SpotSigs: Robust and efficient near duplicate detection in large web collections. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1--8.
[34]
Ferhan Ture, Tamer Elsayed, and Jimmy Lin. 2011. No free lunch: Brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information (SIGIR’11). 943--952.
[35]
Rares Vernica, Michael J. Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 International Conference on Management of Data (SIGMOD’10). 495--506.
[36]
Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. 1--35.
[37]
Ye Wang, Ahmed Metwally, and Srinivasan Parthasarathy. 2013. Scalable all-pairs similarity search in metric spaces. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13). 829.
[38]
Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. 2008. Efficient similarity joins for near duplicate detection. In Proceeding of the 17th International Conference on World Wide Web (WWW’08). ACM, 131--140.
[39]
Shanzhong Zhu, Alexandra Potapova, Maha Alabduljalil, Xin Liu, and Tao Yang. 2012. Clustering and load balancing optimization for redundant content removal. In Proceeding of the 21st International Conference on World Wide Web. 103--112.

Cited By

View all
  • (2020)An Efficient Method for Scientific Data Retrieval ServiceProceedings of the 3rd International Conference on Big Data Technologies10.1145/3422713.3422731(6-10)Online publication date: 18-Sep-2020
  • (2020)Programming bsp and multi-bsp algorithms in mlThe Journal of Supercomputing10.1007/s11227-019-02822-976:7(5079-5097)Online publication date: 1-Jul-2020
  • (2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 11, Issue 3
August 2017
372 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3058790
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 April 2017
Accepted: 01 October 2016
Revised: 01 May 2016
Received: 01 June 2015
Published in TKDD Volume 11, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. All-pairs similarity search
  2. data traversal
  3. memory hierarchy
  4. partitioning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Center for Scientific Computing at CNSI/MRL
  • Kuwait University Scholarship
  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)70
  • Downloads (Last 6 weeks)15
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)An Efficient Method for Scientific Data Retrieval ServiceProceedings of the 3rd International Conference on Big Data Technologies10.1145/3422713.3422731(6-10)Online publication date: 18-Sep-2020
  • (2020)Programming bsp and multi-bsp algorithms in mlThe Journal of Supercomputing10.1007/s11227-019-02822-976:7(5079-5097)Online publication date: 1-Jul-2020
  • (2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
  • (2018)Combining Cache and Priority Queue to Enhance Evaluation of Similarity Search Queries2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)10.1109/FSKD.2018.8687208(956-963)Online publication date: Jul-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media