ABSTRACT
The theta-join between tables is a common operation in the data query and statistical analysis. When dealing with large amounts of data, it will produce a great deal of cost. The theta-join inevitably generates huge computing and communication overhead during data processing in the distributed environment. Besides, due to the diversity of data, it also brings about the problem of data skew. In order to solve uneven data distribution in theta-join and data skew in data processing, we propose a solution, which can improve the data filtering strategy and put forward a data distribution method using some affecting factors of data join efficiency quantified by us. Our solution is implemented based on the distributed computing framework Spark. The experimental results show that our method can be used for many types of data and also shows better performance.
- Meng, L., & Yang, G. (2017). Research on the analysis and processing method of massive data based on parallel database. Electronic Design Engineering.Google Scholar
- Joo, I. H. (2017). Spatial big data query processing system supporting sql-based query language in hadoop., 10(1), 1--8.Google Scholar
- Liu, R. C., Zhou, M. Q., Xing-Jie, P. I., & Zhao, X. (2017). Optimization of the equi-join problem based on big data in spark. Modern Computer.Google Scholar
- Lee, T., Kim, K., & Kim, H. J. (2012). Join processing using Bloom filter in MapReduce. ACM Research in Applied Computation Symposium(pp.100--105). ACM. Google ScholarDigital Library
- Liu, W., Li, Z., & Zhou, Y. (2017). An Efficient Filter Strategy for Theta-Join Query in Distributed Environment. International Conference on Parallel Processing Workshops (pp.77--84). IEEE.Google Scholar
- Okcan, A., & Riedewald, M. (2011). Processing theta-joins using MapReduce. ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June (pp.949--960). DBLP. Google ScholarDigital Library
- Zhang, C., Li, J., & Wu, L. (2013). Optimizing theta-joins in a mapreduce environment. International Journal of Database Theory & Application, 6.Google Scholar
- Myung, J., Shim, J., Yeon, J., & Lee, S. G. (2016). Handling data skew in join algorithms using mapreduce. Expert Systems with Applications, 51, 286--299. Google ScholarDigital Library
- He, M., Li, G., Huang, C., Ye, Y., & Tian, W. (2017). A Comparative Study of Data Skew in Hadoop. Vi International Conference (pp.1--6). Google ScholarDigital Library
- Hassan, M. A. H., & Bamha, M. (2015). Towards scalability and data skew handling in groupby-joins using mapreduce model *. Procedia Computer Science, 51(1), 70--7. Google ScholarDigital Library
Index Terms
- Optimization of Data Distribution Strategy in Theta-join Process based on Spark
Recommendations
Two MRJs for Multi-way Theta-Join in MapReduce
IDCS 2013: Proceedings of the 6th International Conference on Internet and Distributed Computing Systems - Volume 8223MapReduce is the most popular platform used in cloud computing for large-scale data processing. Generally, data processing involves multi-way Theta-joins join operations.Although multi-way Theta-joins could be processed in MapReduce by using a sequence ...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Load balancing in join algorithms for skewed data in MapReduce systems
Join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms based on MapReduce are not efficient ...
Comments