research-article

A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data

Authors:

Gheorghi Guzun,

Guadalupe Canahuate,

David ChiuAuthors Info & Claims

IDEAS '16: Proceedings of the 20th International Database Engineering & Applications Symposium

Pages 43 - 52

https://doi.org/10.1145/2938503.2938525

Published: 11 July 2016 Publication History

Abstract

Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the global scores. Executing this type of queries is particularly challenging for high-dimensional data. Recently, bit-sliced indices (BSI) were proposed to answer these high-dimensional preference queries efficiently in a centralized environment.

As MapReduce and key-value stores proliferate as the preferred methods for analyzing big data, we set up to evaluate the performance of BSI in a distributed environment, in terms of index size, network traffic, and execution time of preference (top-k) queries over high-dimensional data. We implemented three MapReduce algorithms for processing aggregations and top-k queries over the BSI index: a baseline algorithm using a tree reduction of the slices, a group-slice algorithm, and an optimized two-phase algorithm that uses bit-slice mapping. The implementations are on top of Apache Spark using vertical and horizontal data partitioning. The bit-slice mapping approach is shown to outperform the baseline map-reduce implementations by virtue of using a reduced size index and by featuring a better control over task granularity and load balancing.

References

[1]

P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Commun, 5, 2014.

[2]

K. S. Candan, P. Nagarkar, M. Nagendra, and R. Yu. Rankloud: A scalable ranked query processing framework on hadoop. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11, pages 574--577, New York, NY, USA, 2011. ACM.

Digital Library

[3]

P. Cao and Z. Wang. Efficient top-k query calculation in distributed networks. In Proceedings of the Twenty-third Annual ACM Symposium on Principles of Distributed Computing, PODC '04, pages 206--215, New York, NY, USA, 2004. ACM.

Digital Library

[4]

S. Chambi, D. Lemire, O. Kaser, and R. Godin. Better bitmap performance with roaring bitmaps. arXiv preprint arXiv:1402.6407, 2014.

[5]

A. Davidson and A. Or. Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep, 2013.

[6]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.

Digital Library

[7]

C. Doulkeridis and K. Norvag. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, 23(3):355--380, June 2014.

Digital Library

[8]

R. Fagin, A. L. Y, and M. N. Z. Optimal aggregation algorithms for middleware. In In PODS, pages 102--113, 2001.

Digital Library

[9]

U. Guntzer, W.-T. Balke, and W. Kiesling. Optimizing multi-feature queries for image databases. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB '00, pages 419--428, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

Digital Library

[10]

G. Guzun and G. Canahuate. Hybrid query optimization for hard-tocompress bit-vectors. The VLDB Journal, pages 1--16, 2015.

Digital Library

[11]

G. Guzun and G. Canahuate. Performance evaluation of word-aligned compression methods for bitmap indices. Knowledge and Information Systems, pages 1--28, 2015.

Digital Library

[12]

G. Guzun, G. Canahuate, D. Chiu, and J. Sawin. A tunable compression framework for bitmap indices. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 484--495. IEEE, 2014.

[13]

G. Guzun, J. Tosado, and G. Canahuate. Slicing the dimensionality: Top-k query processing for high-dimensional spaces. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV, pages 26--50. Springer, 2014.

[14]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, volume 41, pages 59--72. ACM, 2007.

Digital Library

[15]

D. Lemire, O. Kaser, and E. Gutarra. Reordering rows for better compression: Beyond the lexicographic order. ACM Transactions on Database Systems, 37(3):20:1--20:29, 2012.

Digital Library

[16]

X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB '03, pages 129--140. VLDB Endowment, 2003.

Digital Library

[17]

A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst., 29(2):319--362, June 2004.

Digital Library

[18]

S. Michel, P. Triantafillou, and G. Weikum. Klee: A framework for distributed top-k query algorithms. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05, pages 637--648. VLDB Endowment, 2005.

Digital Library

[19]

N. Ntarmos, I. Patlakas, and P. Triantafillou. Rank join queries in nosql databases. Proc. VLDB Endow., 7(7):493--504, Mar. 2014.

Digital Library

[20]

P. O'Neil and D. Quass. Improved query performance with variant indexes. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 38--49. ACM Press, 1997.

Digital Library

[21]

M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996.

[22]

D. Rinfret. Answering preference queries with bit-sliced index arithmetic. In Proceedings of the 2008 C 3 S 2 E conference, pages 173--185. ACM, 2008.

Digital Library

[23]

D. Rinfret, P. O'Neil, and E. O'Neil. Bit-sliced index arithmetic. SIGMOD Rec., 30(2):47--57, 2001.

Digital Library

[24]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009.

Digital Library

[25]

T. White. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.

Digital Library

[26]

K. Wu, E. J. Otoo, and A.Shoshani. Compressing bitmap indexes for faster search operations. In Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM'02), pages 99--108, 2002.

Digital Library

[27]

K. Wu, E. J. Otoo, A. Shoshani, and H. Nordberg. Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory, 2001.

[28]

H. Yu, H.-G. Li, P. Wu, D. Agrawal, and A. El Abbadi. Efficient processing of distributed top-k queries. In Proceedings of the 16th International Conference on Database and Expert Systems Applications, DEXA'05, pages 65--74, Berlin, Heidelberg, 2005. Springer-Verlag.

Digital Library

[29]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association.

Digital Library

[30]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, volume 10, page 10, 2010.

Digital Library

Cited By

Qin YGuzun G(2021)Faster Multidimensional Data Queries on Infrastructure Monitoring SystemsBig Data Research10.1016/j.bdr.2021.100288(100288)Online publication date: Nov-2021
https://doi.org/10.1016/j.bdr.2021.100288
Qin YGuzun G(2019)Multidimensional Preference Query Optimization on Infrastructure Monitoring Systems2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9005666(3727-3736)Online publication date: Dec-2019
https://doi.org/10.1109/BigData47090.2019.9005666
Guzun GCanahuate G(2019)High-dimensional similarity searches using query driven dynamic quantization and distributed indexingDistributed and Parallel Databases10.1007/s10619-019-07266-xOnline publication date: 11-Apr-2019
https://doi.org/10.1007/s10619-019-07266-x
Show More Cited By

A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Scalable preference queries for high-dimensional data using map-reduce
BIG DATA '15: Proceedings of the 2015 IEEE International Conference on Big Data (Big Data)

Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function ...
Scalable 3D spatial queries for analytical pathology imaging with MapReduce
SIGSPACIAL '16: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

3D analytical pathology imaging examines high resolution 3D image volumes of human tissues to facilitate biomedical research and provide potential effective diagnostic assistance. Such approach - quantitative analysis of large- scale 3D pathology image ...
Two MRJs for Multi-way Theta-Join in MapReduce
IDCS 2013: Proceedings of the 6th International Conference on Internet and Distributed Computing Systems - Volume 8223

MapReduce is the most popular platform used in cloud computing for large-scale data processing. Generally, data processing involves multi-way Theta-joins join operations.Although multi-way Theta-joins could be processed in MapReduce by using a sequence ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

IDEAS '16: Proceedings of the 20th International Database Engineering & Applications Symposium

July 2016

420 pages

ISBN:9781450341189

DOI:10.1145/2938503

Editor:
Evan Desai
ConfSys
,
General Chair:
Bipin C. Desai
Concordia University
,
Program Chairs:
Motomichi Toyama
Keio University
,
Jorge Bernardino
ISEC

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Keio University: Keio University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

IDEAS '16

IDEAS '16: International Database Engineering & Applications Symposium

July 11 - 13, 2016

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
105
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qin YGuzun G(2021)Faster Multidimensional Data Queries on Infrastructure Monitoring SystemsBig Data Research10.1016/j.bdr.2021.100288(100288)Online publication date: Nov-2021
https://doi.org/10.1016/j.bdr.2021.100288
Qin YGuzun G(2019)Multidimensional Preference Query Optimization on Infrastructure Monitoring Systems2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9005666(3727-3736)Online publication date: Dec-2019
https://doi.org/10.1109/BigData47090.2019.9005666
Guzun GCanahuate G(2019)High-dimensional similarity searches using query driven dynamic quantization and distributed indexingDistributed and Parallel Databases10.1007/s10619-019-07266-xOnline publication date: 11-Apr-2019
https://doi.org/10.1007/s10619-019-07266-x
Guzun GCanahuate G(2017)Supporting Dynamic Quantization for High-Dimensional Data AnalyticsProceedings of the ExploreDB'1710.1145/3077331.3077336(1-6)Online publication date: 14-May-2017
https://dl.acm.org/doi/10.1145/3077331.3077336
Wang QYu CZhang YLi HZhong P(2017)Delay‐bounded skyline computing for large‐scale real‐time online data analyticsConcurrency and Computation: Practice and Experience10.1002/cpe.408529:10Online publication date: 6-Mar-2017
https://doi.org/10.1002/cpe.4085
Tosado JGuzun GCanahuate GMantilla RAli MNewsam SRenz MTrajcevski GRavada S(2016)On-demand aggregation of gridded data over user-specified spatio-temporal domainsProceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/2996913.2996944(1-4)Online publication date: 31-Oct-2016
https://dl.acm.org/doi/10.1145/2996913.2996944
Guzun GMcClurg JCanahuate GMudumbai R(2016)Power efficient big data analytics algorithms through low-level operations2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840623(355-361)Online publication date: Dec-2016
https://doi.org/10.1109/BigData.2016.7840623

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten