skip to main content
10.1145/2938503.2938525acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data

Authors Info & Claims
Published:11 July 2016Publication History

ABSTRACT

Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the global scores. Executing this type of queries is particularly challenging for high-dimensional data. Recently, bit-sliced indices (BSI) were proposed to answer these high-dimensional preference queries efficiently in a centralized environment.

As MapReduce and key-value stores proliferate as the preferred methods for analyzing big data, we set up to evaluate the performance of BSI in a distributed environment, in terms of index size, network traffic, and execution time of preference (top-k) queries over high-dimensional data. We implemented three MapReduce algorithms for processing aggregations and top-k queries over the BSI index: a baseline algorithm using a tree reduction of the slices, a group-slice algorithm, and an optimized two-phase algorithm that uses bit-slice mapping. The implementations are on top of Apache Spark using vertical and horizontal data partitioning. The bit-slice mapping approach is shown to outperform the baseline map-reduce implementations by virtue of using a reduced size index and by featuring a better control over task granularity and load balancing.

References

  1. P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Commun, 5, 2014.Google ScholarGoogle Scholar
  2. K. S. Candan, P. Nagarkar, M. Nagendra, and R. Yu. Rankloud: A scalable ranked query processing framework on hadoop. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11, pages 574--577, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Cao and Z. Wang. Efficient top-k query calculation in distributed networks. In Proceedings of the Twenty-third Annual ACM Symposium on Principles of Distributed Computing, PODC '04, pages 206--215, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chambi, D. Lemire, O. Kaser, and R. Godin. Better bitmap performance with roaring bitmaps. arXiv preprint arXiv:1402.6407, 2014.Google ScholarGoogle Scholar
  5. A. Davidson and A. Or. Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep, 2013.Google ScholarGoogle Scholar
  6. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Doulkeridis and K. Norvag. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, 23(3):355--380, June 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Fagin, A. L. Y, and M. N. Z. Optimal aggregation algorithms for middleware. In In PODS, pages 102--113, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. U. Guntzer, W.-T. Balke, and W. Kiesling. Optimizing multi-feature queries for image databases. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB '00, pages 419--428, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Guzun and G. Canahuate. Hybrid query optimization for hard-tocompress bit-vectors. The VLDB Journal, pages 1--16, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Guzun and G. Canahuate. Performance evaluation of word-aligned compression methods for bitmap indices. Knowledge and Information Systems, pages 1--28, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Guzun, G. Canahuate, D. Chiu, and J. Sawin. A tunable compression framework for bitmap indices. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 484--495. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  13. G. Guzun, J. Tosado, and G. Canahuate. Slicing the dimensionality: Top-k query processing for high-dimensional spaces. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV, pages 26--50. Springer, 2014.Google ScholarGoogle Scholar
  14. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, volume 41, pages 59--72. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Lemire, O. Kaser, and E. Gutarra. Reordering rows for better compression: Beyond the lexicographic order. ACM Transactions on Database Systems, 37(3):20:1--20:29, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB '03, pages 129--140. VLDB Endowment, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst., 29(2):319--362, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Michel, P. Triantafillou, and G. Weikum. Klee: A framework for distributed top-k query algorithms. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05, pages 637--648. VLDB Endowment, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Ntarmos, I. Patlakas, and P. Triantafillou. Rank join queries in nosql databases. Proc. VLDB Endow., 7(7):493--504, Mar. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. O'Neil and D. Quass. Improved query performance with variant indexes. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 38--49. ACM Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Rinfret. Answering preference queries with bit-sliced index arithmetic. In Proceedings of the 2008 C 3 S 2 E conference, pages 173--185. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Rinfret, P. O'Neil, and E. O'Neil. Bit-sliced index arithmetic. SIGMOD Rec., 30(2):47--57, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. White. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Wu, E. J. Otoo, and A.Shoshani. Compressing bitmap indexes for faster search operations. In Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM'02), pages 99--108, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Wu, E. J. Otoo, A. Shoshani, and H. Nordberg. Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory, 2001.Google ScholarGoogle Scholar
  28. H. Yu, H.-G. Li, P. Wu, D. Agrawal, and A. El Abbadi. Efficient processing of distributed top-k queries. In Proceedings of the 16th International Conference on Database and Expert Systems Applications, DEXA'05, pages 65--74, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, volume 10, page 10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        IDEAS '16: Proceedings of the 20th International Database Engineering & Applications Symposium
        July 2016
        420 pages
        ISBN:9781450341189
        DOI:10.1145/2938503

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 July 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate74of210submissions,35%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader