research-article

Optimization of Data Distribution Strategy in Theta-join Process based on Spark

Authors:
Shijiu Cao

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
View Profile

,
E. Haihong

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
View Profile

,
Meina Song

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
View Profile

,
Ken Zhang

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
View Profile

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and SystemsJuly 2018Pages 71–75https://doi.org/10.1145/3242840.3242861

Published:27 July 2018Publication History

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems

Pages 71–75

ABSTRACT

The theta-join between tables is a common operation in the data query and statistical analysis. When dealing with large amounts of data, it will produce a great deal of cost. The theta-join inevitably generates huge computing and communication overhead during data processing in the distributed environment. Besides, due to the diversity of data, it also brings about the problem of data skew. In order to solve uneven data distribution in theta-join and data skew in data processing, we propose a solution, which can improve the data filtering strategy and put forward a data distribution method using some affecting factors of data join efficiency quantified by us. Our solution is implemented based on the distributed computing framework Spark. The experimental results show that our method can be used for many types of data and also shows better performance.

References

Meng, L., & Yang, G. (2017). Research on the analysis and processing method of massive data based on parallel database. Electronic Design Engineering.Google Scholar
Joo, I. H. (2017). Spatial big data query processing system supporting sql-based query language in hadoop., 10(1), 1--8.Google Scholar
Liu, R. C., Zhou, M. Q., Xing-Jie, P. I., & Zhao, X. (2017). Optimization of the equi-join problem based on big data in spark. Modern Computer.Google Scholar
Lee, T., Kim, K., & Kim, H. J. (2012). Join processing using Bloom filter in MapReduce. ACM Research in Applied Computation Symposium(pp.100--105). ACM. Google ScholarDigital Library
Liu, W., Li, Z., & Zhou, Y. (2017). An Efficient Filter Strategy for Theta-Join Query in Distributed Environment. International Conference on Parallel Processing Workshops (pp.77--84). IEEE.Google Scholar
Okcan, A., & Riedewald, M. (2011). Processing theta-joins using MapReduce. ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June (pp.949--960). DBLP. Google ScholarDigital Library
Zhang, C., Li, J., & Wu, L. (2013). Optimizing theta-joins in a mapreduce environment. International Journal of Database Theory & Application, 6.Google Scholar
Myung, J., Shim, J., Yeon, J., & Lee, S. G. (2016). Handling data skew in join algorithms using mapreduce. Expert Systems with Applications, 51, 286--299. Google ScholarDigital Library
He, M., Li, G., Huang, C., Ye, Y., & Tian, W. (2017). A Comparative Study of Data Skew in Hadoop. Vi International Conference (pp.1--6). Google ScholarDigital Library
Hassan, M. A. H., & Bamha, M. (2015). Towards scalability and data skew handling in groupby-joins using mapreduce model *. Procedia Computer Science, 51(1), 70--7. Google ScholarDigital Library

Index Terms

Optimization of Data Distribution Strategy in Theta-join Process based on Spark
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms

Recommendations

Two MRJs for Multi-way Theta-Join in MapReduce
IDCS 2013: Proceedings of the 6th International Conference on Internet and Distributed Computing Systems - Volume 8223

MapReduce is the most popular platform used in cloud computing for large-scale data processing. Generally, data processing involves multi-way Theta-joins join operations.Although multi-way Theta-joins could be processed in MapReduce by using a sequence ...
Read More
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Read More
Load balancing in join algorithms for skewed data in MapReduce systems

Join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms based on MapReduce are not efficient ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems
July 2018
245 pages
ISBN:9781450365093
DOI:10.1145/3242840

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Big data
Spark
data skew
distributed computing
theta-join
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 114
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimization of Data Distribution Strategy in Theta-join Process based on Spark

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Two MRJs for Multi-way Theta-Join in MapReduce

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

Load balancing in join algorithms for skewed data in MapReduce systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimization of Data Distribution Strategy in Theta-join Process based on Spark

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Two MRJs for Multi-way Theta-Join in MapReduce

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

Load balancing in join algorithms for skewed data in MapReduce systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media