skip to main content
10.1145/3242840.3242861acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicacsConference Proceedingsconference-collections
research-article

Optimization of Data Distribution Strategy in Theta-join Process based on Spark

Authors Info & Claims
Published:27 July 2018Publication History

ABSTRACT

The theta-join between tables is a common operation in the data query and statistical analysis. When dealing with large amounts of data, it will produce a great deal of cost. The theta-join inevitably generates huge computing and communication overhead during data processing in the distributed environment. Besides, due to the diversity of data, it also brings about the problem of data skew. In order to solve uneven data distribution in theta-join and data skew in data processing, we propose a solution, which can improve the data filtering strategy and put forward a data distribution method using some affecting factors of data join efficiency quantified by us. Our solution is implemented based on the distributed computing framework Spark. The experimental results show that our method can be used for many types of data and also shows better performance.

References

  1. Meng, L., & Yang, G. (2017). Research on the analysis and processing method of massive data based on parallel database. Electronic Design Engineering.Google ScholarGoogle Scholar
  2. Joo, I. H. (2017). Spatial big data query processing system supporting sql-based query language in hadoop., 10(1), 1--8.Google ScholarGoogle Scholar
  3. Liu, R. C., Zhou, M. Q., Xing-Jie, P. I., & Zhao, X. (2017). Optimization of the equi-join problem based on big data in spark. Modern Computer.Google ScholarGoogle Scholar
  4. Lee, T., Kim, K., & Kim, H. J. (2012). Join processing using Bloom filter in MapReduce. ACM Research in Applied Computation Symposium(pp.100--105). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Liu, W., Li, Z., & Zhou, Y. (2017). An Efficient Filter Strategy for Theta-Join Query in Distributed Environment. International Conference on Parallel Processing Workshops (pp.77--84). IEEE.Google ScholarGoogle Scholar
  6. Okcan, A., & Riedewald, M. (2011). Processing theta-joins using MapReduce. ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June (pp.949--960). DBLP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zhang, C., Li, J., & Wu, L. (2013). Optimizing theta-joins in a mapreduce environment. International Journal of Database Theory & Application, 6.Google ScholarGoogle Scholar
  8. Myung, J., Shim, J., Yeon, J., & Lee, S. G. (2016). Handling data skew in join algorithms using mapreduce. Expert Systems with Applications, 51, 286--299. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. He, M., Li, G., Huang, C., Ye, Y., & Tian, W. (2017). A Comparative Study of Data Skew in Hadoop. Vi International Conference (pp.1--6). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hassan, M. A. H., & Bamha, M. (2015). Towards scalability and data skew handling in groupby-joins using mapreduce model *. Procedia Computer Science, 51(1), 70--7. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimization of Data Distribution Strategy in Theta-join Process based on Spark

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems
      July 2018
      245 pages
      ISBN:9781450365093
      DOI:10.1145/3242840

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 July 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader