research-article

Public Access

Partitioned Similarity Search with Cache-Conscious Data Traversal

Authors:

Maha Alabduljalil,

Tao YangAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 11, Issue 3

Article No.: 34, Pages 1 - 38

https://doi.org/10.1145/3014060

Published: 14 April 2017 Publication History

Abstract

All pairs similarity search (APSS) is used in many web search and data mining applications. Previous work has used techniques such as comparison filtering, inverted indexing, and parallel accumulation of partial results. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-phase approach called Partition-based Similarity Search (PSS). The first phase is to partition the data and group vectors that are potentially similar. The second phase is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Due to data sparsity and the presence of memory hierarchy, accessing feature vectors during the partition comparison phase incurs significant overhead. This paper introduces a cache-conscious design for data layout and traversal to reduce access time through size-controlled data splitting and vector coalescing, and it provides an analysis to guide the choice of optimization parameters. The evaluation results show that for the tested datasets, the proposed approach can lead to an early elimination of unnecessary I/O and data communication while sustaining parallel efficiency with one order of magnitude of performance improvement and it can also be integrated with LSH for approximated APSS.

References

[1]

Fabio Aiolli. 2013. Efficient top-n recommendation for very large scale binary rated datasets. In Proceedings of the 7th ACM Conference on Recommender Systems. 273--280.

Digital Library

[2]

Maha Alabduljalil, Xun Tang, and Tao Yang. 2013a. Cache-conscious performance optimization for similarity search. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR).713--722.

Digital Library

[3]

Maha Alabduljalil, Xun Tang, and Tao Yang. 2013b. Optimizing parallel algorithms for all pairs similarity search. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM). 203--212.

Digital Library

[4]

David C. Anastasiu and George Karypis. 2014. L2AP: Fast cosine similarity search with prefix L-2 norm bounds. In Proceedings of IEEE 30th International Conference on Data Engineering (ICDE’14). 784--795.

[5]

Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient exact set-similarity joins. In Proceedings of the 32nd International Conference on Very Large Data Bases. 918--929.

Digital Library

[6]

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison Wesley.

[7]

Ranieri Baraglia, Gianmarco De Francisci Morales, and Claudio Lucchese. 2010. Document similarity self-join with MapReduce. In Proceedings of the 2010 IEEE International Conference on Data Mining. 731--736.

Digital Library

[8]

Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. 131--140.

Digital Library

[9]

Peter Boncz, Data Distilleries B. V., Stefan Manegold, and Martin L. Kersten. 1999. Database architecture optimized for the new bottleneck : Memory access. In Proceedings of the 25th International Conference on Very Large Data Bases. 54--65

[10]

Fidel Cacheda, Víctor Carneiro, Diego Fernández, and Vreixo Formoso. 2011. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Trans. Web 5, 1 (Feb. 2011), Article 2.

Digital Library

[11]

Abdur Chowdhury, Ophir Frieder, David A. Grossman, and M. Catherine McCabe. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2 (2002), 171--191.

Digital Library

[12]

J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (March 1990), 1--17.

Digital Library

[13]

Iain S. Duff, Michael A. Heroux, and Roldan Pozo. 2002. An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum. ACM Trans. Math. Softw. 28, 2 (June 2002), 239--267.

Digital Library

[14]

Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases. 518--529.

Digital Library

[15]

Hannaneh Hajishirzi, Wen-tau Yih, and Aleksander Kolcz. 2010. Adaptive near-duplicate detection via similarity learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10). ACM, New York, NY, 419--426.

Digital Library

[16]

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC’98). 604--613.

Digital Library

[17]

Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). 219--230.

Digital Library

[18]

David Kanter. 2010. MD’s bulldozer microarchitecture. Retrieved from http://www.realworldtech.com/ (2010).

[19]

Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. 2004. Improved robustness of signature-based near-replica detection via lexicon randomization. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 605--610.

[20]

David Levinthal. 2009. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel (2009). https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf.

[21]

Jimmy Lin. 2009. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 155--162.

Digital Library

[22]

Stefan Manegold, Peter Boncz, and Martin L. Kersten. 2002. Generic database cost models for hierarchical memory systems. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02). 191--202.

[23]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2007. Detectives: Detecting coalition hit inflation attacks in advertising networks streams. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 241--250.

Digital Library

[24]

Ahmed Metwally and Christos Faloutsos. 2012. V-SMART-Join : A Scalable MapReduce framework for all-pair similarity joins of multisets and vectors. In Proceedings of the VLDB Endowment, Vol. 5. 704--715.

Digital Library

[25]

Gianmarco De Francisci Morales, Claudio Lucchese, and Ranieri Baraglia. 2010. Scaling out all pairs similarity search with MapReduce. In Proceedings of the 8th Workshop on LargeScale Distributed Systems for Information Retrieval (2010).

[26]

Mehran Sahami and Timothy D. Heilman. 2006. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th International Conference on World Wide Web (WWW’06). 377--386.

Digital Library

[27]

Venu Satuluri and Srinivasan Parthasarathy. 2012. Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endow. 5, 5 (Jan. 2012), 430--441.

Digital Library

[28]

Ambuj Shatdal, Chander Kant, and Jeffrey F. Naughton. 1994. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94). Morgan Kaufmann Publishers Inc, 510--521.

[29]

Kai Shen, Tao Yang, and Xiangmin Jiao. 2000. S+: Efficient 2D sparse LU factorization on parallel machines. SIAM J. Matrix Anal. Appl. 22, 1 (April 2000), 282--305.

Digital Library

[30]

Narayanan Shivakumar and Hector Garcia-Molina. 1996. Building a scalable and accurate copy detection mechanism. In Proceedings of the 1st ACM International Conference on Digital Libraries (DL’96). 160--168.

Digital Library

[31]

Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc. VLDB Endow. 6, 14 (Sept. 2013), 1930--1941.

Digital Library

[32]

Xun Tang, Maha Alabduljalil, Xin Jin, and Tao Yang. 2014. Load balancing for partition-based similarity search. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval - SIGIR’14 (2014). 193--202.

Digital Library

[33]

Martin Theobald. 2008. SpotSigs: Robust and efficient near duplicate detection in large web collections. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1--8.

Digital Library

[34]

Ferhan Ture, Tamer Elsayed, and Jimmy Lin. 2011. No free lunch: Brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information (SIGIR’11). 943--952.

Digital Library

[35]

Rares Vernica, Michael J. Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 International Conference on Management of Data (SIGMOD’10). 495--506.

Digital Library

[36]

Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. 1--35.

[37]

Ye Wang, Ahmed Metwally, and Srinivasan Parthasarathy. 2013. Scalable all-pairs similarity search in metric spaces. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13). 829.

Digital Library

[38]

Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. 2008. Efficient similarity joins for near duplicate detection. In Proceeding of the 17th International Conference on World Wide Web (WWW’08). ACM, 131--140.

Digital Library

[39]

Shanzhong Zhu, Alexandra Potapova, Maha Alabduljalil, Xin Liu, and Tao Yang. 2012. Clustering and load balancing optimization for redundant content removal. In Proceeding of the 21st International Conference on World Wide Web. 103--112.

Digital Library

Cited By

Du LLi MXu J(2020)An Efficient Method for Scientific Data Retrieval ServiceProceedings of the 3rd International Conference on Big Data Technologies10.1145/3422713.3422731(6-10)Online publication date: 18-Sep-2020
https://dl.acm.org/doi/10.1145/3422713.3422731
Allombert VGava F(2020)Programming bsp and multi-bsp algorithms in mlThe Journal of Supercomputing10.1007/s11227-019-02822-976:7(5079-5097)Online publication date: 1-Jul-2020
https://dl.acm.org/doi/10.1007/s11227-019-02822-9
Vijaykumar NJain AMajumdar DHsieh KPekhimenko GEbrahimi EHajinazar NGibbons PMutlu O(2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00027
Show More Cited By

Index Terms

Partitioned Similarity Search with Cache-Conscious Data Traversal
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage

Recommendations

Load balancing for partition-based similarity search
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

All pairs similarity search, used in many data mining and information retrieval applications, is a time consuming process. Although a partition-based approach accelerates this process by simplifying parallelism management and avoiding unnecessary I/O ...
Cache-conscious performance optimization for similarity search
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

All-pairs similarity search can be implemented in two stages. The first stage is to partition the data and group potentially similar vectors. The second stage is to run a set of tasks where each task compares a partition of vectors with other candidate ...
Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 11, Issue 3

August 2017

372 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3058790

Editor:
Philip S. Yu
University of Illinois at Chicago, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 April 2017

Accepted: 01 October 2016

Revised: 01 May 2016

Received: 01 June 2015

Published in TKDD Volume 11, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Center for Scientific Computing at CNSI/MRL
Kuwait University Scholarship
NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
273
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)15

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Du LLi MXu J(2020)An Efficient Method for Scientific Data Retrieval ServiceProceedings of the 3rd International Conference on Big Data Technologies10.1145/3422713.3422731(6-10)Online publication date: 18-Sep-2020
https://dl.acm.org/doi/10.1145/3422713.3422731
Allombert VGava F(2020)Programming bsp and multi-bsp algorithms in mlThe Journal of Supercomputing10.1007/s11227-019-02822-976:7(5079-5097)Online publication date: 1-Jul-2020
https://dl.acm.org/doi/10.1007/s11227-019-02822-9
Vijaykumar NJain AMajumdar DHsieh KPekhimenko GEbrahimi EHajinazar NGibbons PMutlu O(2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00027
Nalepa FBatko MZezula P(2018)Combining Cache and Priority Queue to Enhance Evaluation of Similarity Search Queries2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)10.1109/FSKD.2018.8687208(956-963)Online publication date: Jul-2018
https://doi.org/10.1109/FSKD.2018.8687208

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents