skip to main content
10.1145/2938503.2938525acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data

Published: 11 July 2016 Publication History

Abstract

Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the global scores. Executing this type of queries is particularly challenging for high-dimensional data. Recently, bit-sliced indices (BSI) were proposed to answer these high-dimensional preference queries efficiently in a centralized environment.
As MapReduce and key-value stores proliferate as the preferred methods for analyzing big data, we set up to evaluate the performance of BSI in a distributed environment, in terms of index size, network traffic, and execution time of preference (top-k) queries over high-dimensional data. We implemented three MapReduce algorithms for processing aggregations and top-k queries over the BSI index: a baseline algorithm using a tree reduction of the slices, a group-slice algorithm, and an optimized two-phase algorithm that uses bit-slice mapping. The implementations are on top of Apache Spark using vertical and horizontal data partitioning. The bit-slice mapping approach is shown to outperform the baseline map-reduce implementations by virtue of using a reduced size index and by featuring a better control over task granularity and load balancing.

References

[1]
P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Commun, 5, 2014.
[2]
K. S. Candan, P. Nagarkar, M. Nagendra, and R. Yu. Rankloud: A scalable ranked query processing framework on hadoop. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11, pages 574--577, New York, NY, USA, 2011. ACM.
[3]
P. Cao and Z. Wang. Efficient top-k query calculation in distributed networks. In Proceedings of the Twenty-third Annual ACM Symposium on Principles of Distributed Computing, PODC '04, pages 206--215, New York, NY, USA, 2004. ACM.
[4]
S. Chambi, D. Lemire, O. Kaser, and R. Godin. Better bitmap performance with roaring bitmaps. arXiv preprint arXiv:1402.6407, 2014.
[5]
A. Davidson and A. Or. Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep, 2013.
[6]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.
[7]
C. Doulkeridis and K. Norvag. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, 23(3):355--380, June 2014.
[8]
R. Fagin, A. L. Y, and M. N. Z. Optimal aggregation algorithms for middleware. In In PODS, pages 102--113, 2001.
[9]
U. Guntzer, W.-T. Balke, and W. Kiesling. Optimizing multi-feature queries for image databases. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB '00, pages 419--428, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
[10]
G. Guzun and G. Canahuate. Hybrid query optimization for hard-tocompress bit-vectors. The VLDB Journal, pages 1--16, 2015.
[11]
G. Guzun and G. Canahuate. Performance evaluation of word-aligned compression methods for bitmap indices. Knowledge and Information Systems, pages 1--28, 2015.
[12]
G. Guzun, G. Canahuate, D. Chiu, and J. Sawin. A tunable compression framework for bitmap indices. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 484--495. IEEE, 2014.
[13]
G. Guzun, J. Tosado, and G. Canahuate. Slicing the dimensionality: Top-k query processing for high-dimensional spaces. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV, pages 26--50. Springer, 2014.
[14]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, volume 41, pages 59--72. ACM, 2007.
[15]
D. Lemire, O. Kaser, and E. Gutarra. Reordering rows for better compression: Beyond the lexicographic order. ACM Transactions on Database Systems, 37(3):20:1--20:29, 2012.
[16]
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB '03, pages 129--140. VLDB Endowment, 2003.
[17]
A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst., 29(2):319--362, June 2004.
[18]
S. Michel, P. Triantafillou, and G. Weikum. Klee: A framework for distributed top-k query algorithms. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05, pages 637--648. VLDB Endowment, 2005.
[19]
N. Ntarmos, I. Patlakas, and P. Triantafillou. Rank join queries in nosql databases. Proc. VLDB Endow., 7(7):493--504, Mar. 2014.
[20]
P. O'Neil and D. Quass. Improved query performance with variant indexes. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 38--49. ACM Press, 1997.
[21]
M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996.
[22]
D. Rinfret. Answering preference queries with bit-sliced index arithmetic. In Proceedings of the 2008 C 3 S 2 E conference, pages 173--185. ACM, 2008.
[23]
D. Rinfret, P. O'Neil, and E. O'Neil. Bit-sliced index arithmetic. SIGMOD Rec., 30(2):47--57, 2001.
[24]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009.
[25]
T. White. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[26]
K. Wu, E. J. Otoo, and A.Shoshani. Compressing bitmap indexes for faster search operations. In Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM'02), pages 99--108, 2002.
[27]
K. Wu, E. J. Otoo, A. Shoshani, and H. Nordberg. Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory, 2001.
[28]
H. Yu, H.-G. Li, P. Wu, D. Agrawal, and A. El Abbadi. Efficient processing of distributed top-k queries. In Proceedings of the 16th International Conference on Database and Expert Systems Applications, DEXA'05, pages 65--74, Berlin, Heidelberg, 2005. Springer-Verlag.
[29]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association.
[30]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, volume 10, page 10, 2010.

Cited By

View all
  • (2021)Faster Multidimensional Data Queries on Infrastructure Monitoring SystemsBig Data Research10.1016/j.bdr.2021.100288(100288)Online publication date: Nov-2021
  • (2019)Multidimensional Preference Query Optimization on Infrastructure Monitoring Systems2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9005666(3727-3736)Online publication date: Dec-2019
  • (2019)High-dimensional similarity searches using query driven dynamic quantization and distributed indexingDistributed and Parallel Databases10.1007/s10619-019-07266-xOnline publication date: 11-Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IDEAS '16: Proceedings of the 20th International Database Engineering & Applications Symposium
July 2016
420 pages
ISBN:9781450341189
DOI:10.1145/2938503
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • Keio University: Keio University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IDEAS '16

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Faster Multidimensional Data Queries on Infrastructure Monitoring SystemsBig Data Research10.1016/j.bdr.2021.100288(100288)Online publication date: Nov-2021
  • (2019)Multidimensional Preference Query Optimization on Infrastructure Monitoring Systems2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9005666(3727-3736)Online publication date: Dec-2019
  • (2019)High-dimensional similarity searches using query driven dynamic quantization and distributed indexingDistributed and Parallel Databases10.1007/s10619-019-07266-xOnline publication date: 11-Apr-2019
  • (2017)Supporting Dynamic Quantization for High-Dimensional Data AnalyticsProceedings of the ExploreDB'1710.1145/3077331.3077336(1-6)Online publication date: 14-May-2017
  • (2017)Delay‐bounded skyline computing for large‐scale real‐time online data analyticsConcurrency and Computation: Practice and Experience10.1002/cpe.408529:10Online publication date: 6-Mar-2017
  • (2016)On-demand aggregation of gridded data over user-specified spatio-temporal domainsProceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/2996913.2996944(1-4)Online publication date: 31-Oct-2016
  • (2016)Power efficient big data analytics algorithms through low-level operations2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840623(355-361)Online publication date: Dec-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media