ABSTRACT
All-pairs similarity search can be implemented in two stages. The first stage is to partition the data and group potentially similar vectors. The second stage is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Because of data sparsity, accessing feature vectors in memory for runtime comparison in the second stage, incurs significant overhead due to the presence of memory hierarchy. This paper proposes a cache-conscious data layout and traversal optimization to reduce the execution time through size-controlled data splitting and vector coalescing. It also provides an analysis to guide the optimal choice for the parameter setting. Our evaluation with several application datasets verifies the performance gains obtained by the optimization and shows that the proposed scheme is upto 2.74x as fast as the cache-oblivious baseline.
- Maha Alabduljalil, Xun Tang, and Tao Yang. Optimizing parallel algorithms for all pairs similarity search. In Proc. of 6th ACM Inter. Conf. on Web Search and Data Mining (WSDM), 2013. Google ScholarDigital Library
- Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In VLDB'06. Google ScholarDigital Library
- Language Technologies Institute at Carnegie Mellon University. The clueweb09 dataset, http://boston.lti.cs.cmu.edu/data/clueweb09.Google Scholar
- John R. Gilbert Aydin Bulu. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP, 2008.Google Scholar
- Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarDigital Library
- Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of WWW, 2007. Google ScholarDigital Library
- Fidel Cacheda, Víctor Carneiro, Diego Fernández, and Vreixo Formoso. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Trans. Web, 2011. Google ScholarDigital Library
- Abdur Chowdhury, Ophir Frieder, David A. Grossman, and M. Catherine McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 2002. Google ScholarDigital Library
- J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16(1):1--17, March 1990. Google ScholarDigital Library
- Iain S. Duff, Michael A. Heroux, and Roldan Pozo. An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum. ACM Trans. Math. Softw., 28(2):239--267, June 2002. Google ScholarDigital Library
- Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarDigital Library
- Hannaneh Hajishirzi, Wen tau Yih, and Aleksander Kolcz. Adaptive near-duplicate detection via similarity learning. In SIGIR, 2010. Google ScholarDigital Library
- Heimsund Halliday. http://code.google.com/p/matrix-toolkits-java.Google Scholar
- Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the international conference on Web search and web data mining, WSDM '08, pages 219--230, 2008. Google ScholarDigital Library
- David Kanter. Md's bulldozer microarchitecture. realworldtech.com, 2010.Google Scholar
- Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In Proceedings of KDD, 2004. Google ScholarDigital Library
- David Levinthal. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel, 2009.Google Scholar
- Jimmy Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In SIGIR, 2009. Google ScholarDigital Library
- Stefan Manegold, Peter Boncz, and Martin L. Kersten. Generic database cost models for hierarchical memory systems. In VLDB '02, 2002. Google ScholarDigital Library
- Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Detectives: detecting coalition hit inflation attacks in advertising networks streams. In Proceedings of the 16th international conference on World Wide Web, WWW '07. Google ScholarDigital Library
- Gianmarco De Francisci Morales, Claudio Lucchese, and Ranieri Baraglia. Scaling out all pairs similarity search with mapreduce. In 8th Workshop on LargeScale Distributed Systems for Information Retrieval (2010), 2010.Google Scholar
- Mehran Sahami and Timothy D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW '06, pages 377--386, 2006. Google ScholarDigital Library
- Ambuj Shatdal, Chander Kant, and Jeffrey F. Naughton. Cache conscious algorithms for relational query processing. In In Proceedings of the 20th VLDB Conference, pages 510--521. Morgan Kaufmann Publishers Inc, 1994. Google ScholarDigital Library
- Kai Shen, Tao Yang, and Xiangmin Jiao. S+: Efficient 2d sparse lu factorization on parallel machines. SIAM J. Matrix Anal. Appl., 22(1):282--305, April 2000. Google ScholarDigital Library
- Narayanan Shivakumar and Hector Garcia-Molina. Building a scalable and accurate copy detection mechanism. In DL'96 (ACM Inter. Conf. on Digital libraries), pages 160--168. Google ScholarDigital Library
- Ferhan Ture, Tamer Elsayed, and Jimmy Lin. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In SIGIR '2011. Google ScholarDigital Library
- Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In ACM/IEEE Conf. on Supercomputing, 2002. Google ScholarDigital Library
- Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient similarity joins for near duplicate detection. In Proceeding of the 17th international conference on World Wide Web, WWW '08, pages 131--140. ACM, 2008. Google ScholarDigital Library
- Yuan Cao Zhang, Diarmuid Ó Séaghdha, Daniele Quercia, and Tamas Jambor. Auralist: introducing serendipity into music recommendation. In Proceedings of the fifth ACM international conference on Web search and data mining, WSDM '12. ACM, 2012. Google ScholarDigital Library
- Shanzhong Zhu, Alexandra Potapova, Maha Alabduljalil, Xin Liu, and Tao Yang. Clustering and load balancing optimization for redundant content removal. In WWW '12: Inter. Conf. on World Wide Web. Industry Track, 2012. Google ScholarDigital Library
Index Terms
- Cache-conscious performance optimization for similarity search
Recommendations
Partitioned Similarity Search with Cache-Conscious Data Traversal
All pairs similarity search (APSS) is used in many web search and data mining applications. Previous work has used techniques such as comparison filtering, inverted indexing, and parallel accumulation of partial results. However, shuffling intermediate ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Performance of One's Complement Caches
On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which ...
Comments