Abstract
In the last decade, skyline query processing has become widely important because of its usefulness in decision making applications. Since the size of the datasets used for skyline query processing are huge, algorithms for MapReduce-based skyline query processing have been widely studied. However, existing algorithms suffer from low-filtering efficiency for local skyline computation, and unrealistically assume both uniform data distributions and dimensional independence. In this paper, we propose a parallel skyline query processing algorithm for MapReduce using multiple regression analysis. The goal of our algorithm is to efficiently find a set of skylines from a large dataset by reducing the number of candidates prior to the skyline computation. To develop the skyline computation algorithm on anti-correlated datasets, we computed a data filtering threshold line based on a multiple regression analysis of the sampled dataset. To guarantee the accuracy of the skyline result, we considered both a filtering threshold line and a grid-based cell dominance condition. Thus, only relevant data could be computed in the real skyline computation step. For local skyline computation, we utilized an angle-based partitioning of data space that effectively eliminates non-promising points in partitions. For the global skyline computation, we used the dominance relationship among grid-based partitions to prune out unnecessary skyline points. Performance analyses showed that our parallel skyline query processing algorithm outperformed existing algorithms, under various settings.
Similar content being viewed by others
References
Borzsony, K.D., Stocker, K.: The skyline operator. In: Proceedings of the 17th International Conference on Data Engineering, pp. 421–430. IEEE (2001)
Lappas, T., Gunopulos, D.: Efficient Confident Search in Large Review Corpora. Machine Learning and Knowledge Discovery in Databases. Springer, Heidelberg (2010)
Levandoski, J.J., Mokbel, M.F., Khalefa, M.E.: Preference query evaluation over expensive attributes. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 319–328. ACM (2010)
Lee, J., Hwang, S., Nie, Z., Wen, J.-R.: Navigation system for product search. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 1113–1116. IEEE (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Hadoop-Apache Software Foundation project home page. http://hadoop.apache.org/. Accessed 05 Aug 2017
Shvachko, K., et al.: The hadoop distributed file system. In: IEEE 26th Symposium onMass Storage Systems and Technologies (MSST) (2010)
Deng, K., Zhou, X., Shen, H.T.: Multi-source skyline query processing in road networks. In: IEEE 23rd International Conference on Data Engineering (ICDE), pp. 796–805 (2007)
Dellis, E., Seeger, B.: Efficient computation of reverse skyline queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp. 291–302 (2007)
Lee, K.C., et al.: Z-SKY: an efficient skyline query processing framework based on Z-order. VLDB J. 19(3), 333–362 (2010)
Chen, L., Hwang, K., Wu, J.: MapReduce skyline query processing with a new angular partitioning approach. In: IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2012)
Park, Y., Min, J.-K., Shim, K.: Parallel computation of skyline and reverse skyline queries using MapReduce. In: Proceedings of the VLDB Endowment (2013)
Mullesgaard, K., Pedersen, J.L., Lu, H., Zhou, Y.: Efficient skyline computation in MapReduce. In: 17th International Conference on Extending Database Technology (EDBT), pp. 37–48 (2014)
Zhang, B., Zhou, S., Guan, J.: Adapting skyline computation to the MapReduce framework: algorithms and experiments. In: DASFAA Workshops (2011)
Afrati, F.N., Koutris, P., Suciu, D., Ullman, J.D.: Parallel skyline queries. Theory Comput. Syst. 57(4), 1008–1037 (2015)
Köhler, H., Yang, J., Zhou, X.: Efficient parallel skyline processing using hyperplane projections. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 85–96 (2011)
Shang, H., Kitsuregawa, M.: Skyline operator on anti-correlated distributions. Proc. VLDB Endow. 6(9), 649–660 (2013)
Chomicki, J., Godfrey, P., Gryz, J., Liang, D.: Skyline with presorting. In: ICDE, vol. 3, pp. 717–719 (2003)
Tan, K.-L., Eng, P.-K., Ooi, B.C.: Efficient progressive skyline computation In: VLDB (2001)
Hunt, N., Tyrrell, S.: Stratified Sampling. Webpage at Coventry University (2001). Retrieved 12 July 2012
Dismuke, C., Richard, L.: Chapter 9: Ordinary least squares. In: Methods and Designs for Outcomes Research, pp. 93–104. American Society of Health-System Pharmacists (2006)
Cui, B., Lu, H., Xu, Q., Chen, L., Dai, Y., Zhou, Y.C.: Parallel distributed processing of constrained skyline queries by filtering. In: 24th ICDE (2008)
Acknowledgements
This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. R0113-16-0005, Development of a Unified Data Engineering Technology for Large-scale Transaction Processing and Real-time Complex Analytics). This research was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (grant number 2016R1D1A3B03935298).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.
Rights and permissions
About this article
Cite this article
Jang, M., Song, Y. & Chang, JW. A parallel computation of skyline using multiple regression analysis-based filtering on MapReduce. Distrib Parallel Databases 35, 383–409 (2017). https://doi.org/10.1007/s10619-017-7202-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-017-7202-4