ABSTRACT
Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the global scores. Executing this type of queries is particularly challenging for high-dimensional data. Recently, bit-sliced indices (BSI) were proposed to answer these high-dimensional preference queries efficiently in a centralized environment.
As MapReduce and key-value stores proliferate as the preferred methods for analyzing big data, we set up to evaluate the performance of BSI in a distributed environment, in terms of index size, network traffic, and execution time of preference (top-k) queries over high-dimensional data. We implemented three MapReduce algorithms for processing aggregations and top-k queries over the BSI index: a baseline algorithm using a tree reduction of the slices, a group-slice algorithm, and an optimized two-phase algorithm that uses bit-slice mapping. The implementations are on top of Apache Spark using vertical and horizontal data partitioning. The bit-slice mapping approach is shown to outperform the baseline map-reduce implementations by virtue of using a reduced size index and by featuring a better control over task granularity and load balancing.
- P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Commun, 5, 2014.Google Scholar
- K. S. Candan, P. Nagarkar, M. Nagendra, and R. Yu. Rankloud: A scalable ranked query processing framework on hadoop. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11, pages 574--577, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- P. Cao and Z. Wang. Efficient top-k query calculation in distributed networks. In Proceedings of the Twenty-third Annual ACM Symposium on Principles of Distributed Computing, PODC '04, pages 206--215, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- S. Chambi, D. Lemire, O. Kaser, and R. Godin. Better bitmap performance with roaring bitmaps. arXiv preprint arXiv:1402.6407, 2014.Google Scholar
- A. Davidson and A. Or. Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep, 2013.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008. Google ScholarDigital Library
- C. Doulkeridis and K. Norvag. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, 23(3):355--380, June 2014. Google ScholarDigital Library
- R. Fagin, A. L. Y, and M. N. Z. Optimal aggregation algorithms for middleware. In In PODS, pages 102--113, 2001. Google ScholarDigital Library
- U. Guntzer, W.-T. Balke, and W. Kiesling. Optimizing multi-feature queries for image databases. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB '00, pages 419--428, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- G. Guzun and G. Canahuate. Hybrid query optimization for hard-tocompress bit-vectors. The VLDB Journal, pages 1--16, 2015. Google ScholarDigital Library
- G. Guzun and G. Canahuate. Performance evaluation of word-aligned compression methods for bitmap indices. Knowledge and Information Systems, pages 1--28, 2015. Google ScholarDigital Library
- G. Guzun, G. Canahuate, D. Chiu, and J. Sawin. A tunable compression framework for bitmap indices. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 484--495. IEEE, 2014.Google ScholarCross Ref
- G. Guzun, J. Tosado, and G. Canahuate. Slicing the dimensionality: Top-k query processing for high-dimensional spaces. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV, pages 26--50. Springer, 2014.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, volume 41, pages 59--72. ACM, 2007. Google ScholarDigital Library
- D. Lemire, O. Kaser, and E. Gutarra. Reordering rows for better compression: Beyond the lexicographic order. ACM Transactions on Database Systems, 37(3):20:1--20:29, 2012. Google ScholarDigital Library
- X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB '03, pages 129--140. VLDB Endowment, 2003. Google ScholarDigital Library
- A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst., 29(2):319--362, June 2004. Google ScholarDigital Library
- S. Michel, P. Triantafillou, and G. Weikum. Klee: A framework for distributed top-k query algorithms. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05, pages 637--648. VLDB Endowment, 2005. Google ScholarDigital Library
- N. Ntarmos, I. Patlakas, and P. Triantafillou. Rank join queries in nosql databases. Proc. VLDB Endow., 7(7):493--504, Mar. 2014. Google ScholarDigital Library
- P. O'Neil and D. Quass. Improved query performance with variant indexes. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 38--49. ACM Press, 1997. Google ScholarDigital Library
- M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996. Google ScholarDigital Library
- D. Rinfret. Answering preference queries with bit-sliced index arithmetic. In Proceedings of the 2008 C 3 S 2 E conference, pages 173--185. ACM, 2008. Google ScholarDigital Library
- D. Rinfret, P. O'Neil, and E. O'Neil. Bit-sliced index arithmetic. SIGMOD Rec., 30(2):47--57, 2001. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009. Google ScholarDigital Library
- T. White. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012. Google ScholarDigital Library
- K. Wu, E. J. Otoo, and A.Shoshani. Compressing bitmap indexes for faster search operations. In Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM'02), pages 99--108, 2002. Google ScholarDigital Library
- K. Wu, E. J. Otoo, A. Shoshani, and H. Nordberg. Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory, 2001.Google Scholar
- H. Yu, H.-G. Li, P. Wu, D. Agrawal, and A. El Abbadi. Efficient processing of distributed top-k queries. In Proceedings of the 16th International Conference on Database and Expert Systems Applications, DEXA'05, pages 65--74, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, volume 10, page 10, 2010. Google ScholarDigital Library
- A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data
Recommendations
Scalable preference queries for high-dimensional data using map-reduce
BIG DATA '15: Proceedings of the 2015 IEEE International Conference on Big Data (Big Data)Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function ...
Scalable 3D spatial queries for analytical pathology imaging with MapReduce
SIGSPACIAL '16: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems3D analytical pathology imaging examines high resolution 3D image volumes of human tissues to facilitate biomedical research and provide potential effective diagnostic assistance. Such approach - quantitative analysis of large- scale 3D pathology image ...
Scalable and efficient processing of top-k multiple-type integrated queries
AbstractIn this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Comments