skip to main content
10.1145/2484028.2484077acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Cache-conscious performance optimization for similarity search

Authors Info & Claims
Published:28 July 2013Publication History

ABSTRACT

All-pairs similarity search can be implemented in two stages. The first stage is to partition the data and group potentially similar vectors. The second stage is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Because of data sparsity, accessing feature vectors in memory for runtime comparison in the second stage, incurs significant overhead due to the presence of memory hierarchy. This paper proposes a cache-conscious data layout and traversal optimization to reduce the execution time through size-controlled data splitting and vector coalescing. It also provides an analysis to guide the optimal choice for the parameter setting. Our evaluation with several application datasets verifies the performance gains obtained by the optimization and shows that the proposed scheme is upto 2.74x as fast as the cache-oblivious baseline.

References

  1. Maha Alabduljalil, Xun Tang, and Tao Yang. Optimizing parallel algorithms for all pairs similarity search. In Proc. of 6th ACM Inter. Conf. on Web Search and Data Mining (WSDM), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In VLDB'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Language Technologies Institute at Carnegie Mellon University. The clueweb09 dataset, http://boston.lti.cs.cmu.edu/data/clueweb09.Google ScholarGoogle Scholar
  4. John R. Gilbert Aydin Bulu. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP, 2008.Google ScholarGoogle Scholar
  5. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fidel Cacheda, Víctor Carneiro, Diego Fernández, and Vreixo Formoso. Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems. ACM Trans. Web, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Abdur Chowdhury, Ophir Frieder, David A. Grossman, and M. Catherine McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16(1):1--17, March 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Iain S. Duff, Michael A. Heroux, and Roldan Pozo. An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum. ACM Trans. Math. Softw., 28(2):239--267, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hannaneh Hajishirzi, Wen tau Yih, and Aleksander Kolcz. Adaptive near-duplicate detection via similarity learning. In SIGIR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Heimsund Halliday. http://code.google.com/p/matrix-toolkits-java.Google ScholarGoogle Scholar
  14. Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the international conference on Web search and web data mining, WSDM '08, pages 219--230, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David Kanter. Md's bulldozer microarchitecture. realworldtech.com, 2010.Google ScholarGoogle Scholar
  16. Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In Proceedings of KDD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David Levinthal. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel, 2009.Google ScholarGoogle Scholar
  18. Jimmy Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In SIGIR, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Stefan Manegold, Peter Boncz, and Martin L. Kersten. Generic database cost models for hierarchical memory systems. In VLDB '02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Detectives: detecting coalition hit inflation attacks in advertising networks streams. In Proceedings of the 16th international conference on World Wide Web, WWW '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gianmarco De Francisci Morales, Claudio Lucchese, and Ranieri Baraglia. Scaling out all pairs similarity search with mapreduce. In 8th Workshop on LargeScale Distributed Systems for Information Retrieval (2010), 2010.Google ScholarGoogle Scholar
  22. Mehran Sahami and Timothy D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW '06, pages 377--386, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ambuj Shatdal, Chander Kant, and Jeffrey F. Naughton. Cache conscious algorithms for relational query processing. In In Proceedings of the 20th VLDB Conference, pages 510--521. Morgan Kaufmann Publishers Inc, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kai Shen, Tao Yang, and Xiangmin Jiao. S+: Efficient 2d sparse lu factorization on parallel machines. SIAM J. Matrix Anal. Appl., 22(1):282--305, April 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Narayanan Shivakumar and Hector Garcia-Molina. Building a scalable and accurate copy detection mechanism. In DL'96 (ACM Inter. Conf. on Digital libraries), pages 160--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ferhan Ture, Tamer Elsayed, and Jimmy Lin. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In SIGIR '2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In ACM/IEEE Conf. on Supercomputing, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient similarity joins for near duplicate detection. In Proceeding of the 17th international conference on World Wide Web, WWW '08, pages 131--140. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yuan Cao Zhang, Diarmuid Ó Séaghdha, Daniele Quercia, and Tamas Jambor. Auralist: introducing serendipity into music recommendation. In Proceedings of the fifth ACM international conference on Web search and data mining, WSDM '12. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Shanzhong Zhu, Alexandra Potapova, Maha Alabduljalil, Xin Liu, and Tao Yang. Clustering and load balancing optimization for redundant content removal. In WWW '12: Inter. Conf. on World Wide Web. Industry Track, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cache-conscious performance optimization for similarity search

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
        July 2013
        1188 pages
        ISBN:9781450320344
        DOI:10.1145/2484028

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 28 July 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader