Abstract
Skyline queries are useful for finding only interesting tuples from multi-dimensional datasets for multi-criteria decision making. To improve the performance of skyline query processing for large-scale data, it is necessary to use parallel and distributed frameworks such as MapReduce that has been widely used recently. There are several approaches which process skyline queries on a MapReduce framework to improve the performance of query processing. Some methods process a part of the skyline computation in a serial manner, while there are other methods that process all parts of the skyline computation in parallel. However, each of them suffers from at least one of two drawbacks: (1) the serial computations may prevent them from fully utilizing the parallelism of the MapReduce framework; (2) when processing the skyline queries in a parallel and distributed manner, the additional overhead for the parallel processing may outweigh the benefit gained from parallelization. In order to efficiently process skyline queries for large data in parallel, we propose a novel two-phase approach in MapReduce framework. In the first phase, we start by dividing the input dataset into a number of subsets (called cells) and then we compute local skylines only for the qualified cells. The outer-cell filter used in this phase considerably improves the performance by eliminating a large number of tuples in unqualified cells. In the second phase, the global skyline is computed from local skylines. To separately determine global skyline tuples from each local skyline in parallel, we design the inner-cell filter and also propose efficient methods to reduce the overhead caused by computing and utilizing the inner-cell filters. The primary advantage of our approach is that it processes skyline queries fast and in a fully parallelized manner in all states of the MapReduce framework with the two filtering techniques. Throughout extensive experiments, we demonstrate that the proposed approach substantially increases the overall performance of skyline queries in comparison with the state-of-the-art skyline processing methods. Especially, the proposed method achieves remarkably good performance and scalability with regard to the dataset size and the dimensionality. Our approach has significant benefits for large-scale query processing of skylines in distributed and parallel computing environments.
Similar content being viewed by others
References
Afrati FN, Koutris P, Suciu D, Ullman JD (2012) Parallel skyline queries. In: Proceedings of the 15th International Conference on Database Theory, ICDT ’12, pp 274–284. ACM, New York . https://doi.org/10.1145/2274576.2274605
Balke WT, Güntzer U, Zheng JX (2004) Efficient distributed skylining for web information systems. In: International Conference on Extending Database Technology. Springer, Berlin, pp 256–273
Borzsony, S, Kossmann D, Stocker K (2001) The skyline operator. In: Data Engineering, 2001. Proceedings. 17th International Conference on, pp 421–430. IEEE
Chaudhuri S, Dalvi N, Kaushik R (2006) Robust cardinality and cost estimation for skyline operator. In: Data Engineering, 2006. ICDE’06. Proceedings of the 22nd International Conference on, pp 64–64. IEEE
Chen L, Cui B, Lu H (2011) Constrained skyline query processing against distributed data sites. IEEE Trans Knowl Data Eng 23(2):204–217
Chen L, Hwang K, Wu J (2012) Mapreduce skyline query processing with a new angular partitioning approach. In: Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, pp 2262–2270. IEEE
Chin KK, Lee CW (2009) Trafficscan bringing real-time travel information to motorists . https://www.lta.gov.sg/ltaacademy/doc/IS02-p07%20TrafficScan.pdf
Chomicki J, Godfrey P, Gryz J, Liang D (2005) Skyline with presorting: theory and optimizations. In: Intelligent Information Processing and Web Mining. Springer, Berlin, pp 595–604
Cosgaya-Lozano A, Rau-Chaplin A, Zeh N (2007) Parallel computation of skyline queries. In: High Performance Computing Systems and Applications, 2007. HPCS 2007. 21st International Symposium on, pp 12–12. IEEE
Cui B, Chen L, Xu L, Lu H, Song G, Xu Q (2009) Efficient skyline computation in structured peer-to-peer systems. IEEE Trans Knowl Data Eng 21(7):1059–1072
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Gusfield D (1990) Very simple methods for all pairs network flow analysis. SIAM J Comput 19(1):143–155
Huang Z, Jensen CS, Lu H, Ooi BC (2006) Skyline queries against mobile lightweight devices in manets. In: Data Engineering, 2006. ICDE’06. Proceedings of the 22nd International Conference on, pp 66–66. IEEE
Koh JL, Chen CC, Chan CY, Chen AL (2017) Mapreduce skyline query processing with partitioning and distributed dominance tests. Inf Sci 375:114–137
Köhler H, Yang J, Zhou X (2011) Efficient parallel skyline processing using hyperplane projections. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp 85–96. ACM
Lappas T, Gunopulos D (2010) Efficient confident search in large review corpora. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, pp 195–210
Lee J, Hwang Sw, Nie Z, Wen JR. (2010) Navigation system for product search. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pp 1113–1116. IEEE
Mullesgaard K, Pedersen JL, Lu H, Zhou Y (2014) Efficient skyline computation in mapreduce. In: 17th International Conference on Extending Database Technology (EDBT), pp 37–48
Park Y, Min JK, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. Proc VLDB Endow 6(14):2002–2013
Park Y, Min JK, Shim K (2017) Efficient processing of skyline queries using mapreduce. IEEE Transactions on Knowledge and Data Engineering 29(5):1031–1044
Rajit D. https://www.springboard.com/blog/free-public-data-sets-data-science-project/
Rocha-Junior JB, Vlachou A, Doulkeridis C, Nørvåg K (2011) Efficient execution plans for distributed skyline query processing. In: Proceedings of the 14th International Conference on Extending Database Technology, pp 271–282. ACM
Saran H, Vazirani VV (1995) Finding \(k\) cuts within twice the optimal. SIAM J Comput 24(1):101–108
Shang H, Kitsuregawa M (2013) Skyline operator on anti-correlated distributions. Proc VLDB Endow 6(9):649–660
Tan KL, Eng PK, Ooi BC et al (2001) Efficient progressive skyline computation. In: VLDB, vol 1, pp 301–310
Valkanas G, Papadopoulos AN (2010) Efficient and adaptive distributed skyline computation. In: International Conference on Scientific and Statistical Database Management. Springer, Berlin, pp 24–41
Vlachou A, Doulkeridis C, Kotidis Y (2008). Angle-based space partitioning for efficient parallel skyline computation. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp 227–238. ACM
Vlachou A, Doulkeridis C, Kotidis Y, Vazirgiannis M (2007) Skypeer: efficient subspace skyline computation over distributed data. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pp 416–425. IEEE
Wang S, Ooi BC, Tung AK, Xu L (2007) Efficient skyline query processing on peer-to-peer networks. In: 2007 IEEE 23rd International Conference on Data Engineering, pp 1126–1135. IEEE
Wu P, Zhang C, Feng Y, Zhao BY, Agrawal D, El Abbadi A (2006) Parallelizing skyline queries for scalable distribution. In: International Conference on Extending Database Technology. Springer, Berlin, pp 112–130
Yuan Y, Lin X, Liu Q, Wang W, Yu JX, Zhang Q (2005) Efficient computation of the skyline cube. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp 241–252. VLDB Endowment
Zhang B, Zhou S, Guan J (2011) Adapting skyline computation to the mapreduce framework: algorithms and experiments. In: International Conference on Database Systems for Advanced Applications. Springer, Berlin, pp 403–414
Zhang J, Jiang X, Ku WS, Qin X (2016) Efficient parallel skyline evaluation using mapreduce. IEEE Trans Parallel Distrib Syst 27(7):1996–2009
Zhu L, Tao Y, Zhou S (2009) Distributed skyline retrieval with low bandwidth consumption. IEEE Trans Knowl Data Eng 21(3):384–400
Zou L, Chen L, Özsu MT, Zhao D (2010) Dynamic skyline queries in large graphs. In: International Conference on Database Systems for Advanced Applications. Springer, Berlin, pp 62–78
Acknowledgements
This work was supported by the Bio-Synergy Research Project (2013M3A9C4078137) of the MSIT (Ministry of Science and ICT), Korea through the NRF, and by the MSIT (Ministry of Science and ICT), Korea under the ITRC support program (IITP-2017-2013-0-00881) supervised by the IITP.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kim, J., Kim, M.H. An efficient parallel processing method for skyline queries in MapReduce. J Supercomput 74, 886–935 (2018). https://doi.org/10.1007/s11227-017-2171-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2171-y