A speculative parallel decompression algorithm on Apache Spark

Wang, Zhoukai; Zhao, Yinliang; Liu, Yang; Chen, Zhong; Lv, Cuocuo; Li, Yuxiang

doi:10.1007/s11227-017-2000-3

A speculative parallel decompression algorithm on Apache Spark

Published: 21 March 2017

Volume 73, pages 4082–4111, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Zhoukai Wang¹,
Yinliang Zhao¹,
Yang Liu¹,
Zhong Chen¹,
Cuocuo Lv¹ &
…
Yuxiang Li¹

461 Accesses
8 Citations
Explore all metrics

Abstract

Data decompression is one of the most important techniques in data processing and has been widely used in multimedia information transmission and processing. However, the existing decompression algorithms on multicore platforms are time-consuming and do not support large data well. In order to expand parallelism and enhance decompression efficiency on large-scale datasets, based on the software thread-level speculation technique, this paper raises a speculative parallel decompression algorithm on Apache Spark. By analyzing the data structure of the compressed data, the algorithm firstly hires a function to divide compressed data into blocks which can be decompressed independently and then spawns a number of threads to speculatively decompress data blocks in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, the proposed algorithm is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve 2.6\(\times \) speedup when comparing with the traditional approach in average. In addition, with the growing number of working nodes, the execution time cost decreases gradually, and the speedup scales linearly. The results indicate that the decompression efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PLZMA: A Parallel Data Compression Method for Cloud Computing

A Parallel Algorithm for LZW Decompression, with GPU Implementation

Performance comparison of sequential and parallel compression applications for DNA raw data

Article 10 June 2016

Aníbal Guerra, Jaime Lotero & Sebastián Isaza

References

Nelson M, Gailly J-L (1996) The data compression book, vol 2. M&T Books, New York
Google Scholar
Shvachko K et al. (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE
Shoro AG, Soomro TR et al (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol 15(1):7–14
Slagter K et al (2013) An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J Supercomput 66(1):539–555
Article Google Scholar
Cui X et al (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259
Article Google Scholar
Marcuello P, Tubella J, González A (1999) Value prediction for speculative multithreaded architectures. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society
Porwal S et al (2013) Data compression methodologies for lossless data and comparison between algorithms. IJESIT 2(2):142–7
Google Scholar
Liu B et al (2014) A thread partitioning approach for speculative multithreading. J Supercomput 67(3):778–805
Article Google Scholar
Jang H, Kim C, Lee JW (2013) Practical speculative parallelization of variable-length decompression algorithms. In: ACM SIGPLAN Notices. ACM
Zaharia M et al (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing
Gailly J-L, Adler M (2003) The gzip home page. http://www.gzip.org/
Seward J (2000) The bzip2 and libbzip2 official home page. http://www.bzip.org
Gilchrist J (2004) Parallel data compression with bzip2. In: Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems
Adler M (2015) PIGZ: a parallel implementation of gzip for modern multiprocessor, multi-core machines. http://www.zlib.net/pigz/
Klein ST, Wiseman Y (2003) Parallel Huffman decoding with applications to JPEG files. Comput J 46(5):487–497
Article MATH Google Scholar
Liu W et al (2006) POSH: a TLS compiler that exploits program structure. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 158–167
Raman E et al (2008) Spice: speculative parallel iteration chunk execution. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Zilles C, Sohi G (2002) Master/slave speculative parallelization. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Witte EE, Chamberlain RD, Franklin MA (1991) Parallel simulated annealing using speculative computation. IEEE Trans Parallel Distrib Syst 2(4):483–494
Article Google Scholar
Zhao WJ, Yang HB, Wu Y (2010) Parallel genetic algorithm based on thread-level speculation. In: 2010 International Conference on Audio Language and Image Processing (ICALIP)
Zaharia M et al (2012) Fast and interactive analytics over hadoop data with spark. USENIX Login 37(4):45–51
Google Scholar
Lin C-Y et al (2014) Large-scale logistic regression and linear support vector machines using spark. In: 2014 IEEE International Conference on Big Data (Big Data). IEEE
Qiu H et al (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In: 2014 IEEE International on Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE
Islam NS et al (2015) Performance characterization and acceleration of in-memory file systems for hadoop and spark applications on HPC clusters. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE
Liang C, Ru L, Zhu X (2007) R-SpamRank: a spam detection algorithm based on link analysis. J Comput Inf Syst 3(4):1705–1712
Xu D et al (2011) Predicting epidemic tendency through search behavior analysis. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence
Liu Y et al (2011) How do users describe their information need: query recommendation based on snippet click model. Expert Syst Appl 38(11):13847–13856
Google Scholar
Amdahl GM (2013) Computer architecture and amdahl’s law. Computer 46(12):38–46
Article Google Scholar
Zeigler BP, Nutaro JJ, Seo C (2015) What’s the best possible speedup achievable in distributed simulation: Amdahl’s law reconstructed. In: Proceedings of the Symposium on Theory of Modeling and Simulation: DEVS Integrative M&S Symposium. Society for Computer Simulation International

Download references

Acknowledgements

We thank our 3 C laboratory for their support and feedback during this work. We also thank the anonymous reviewers for their insightful comments and suggestions. We thank our colleagues for their collaboration and the present work. We also thank all the reviewers for their specific comments and suggestions. This work is supported by National Natural Science Foundation of China through Grant No. 61640219 and Doctoral Fund of Ministry of Education of China under Grant No. 2013021110012.

Author information

Authors and Affiliations

Department of Computer Science, School of the Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, 710049, People’s Republic of China
Zhoukai Wang, Yinliang Zhao, Yang Liu, Zhong Chen, Cuocuo Lv & Yuxiang Li

Authors

Zhoukai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yinliang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Cuocuo Lv
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yinliang Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Z., Zhao, Y., Liu, Y. et al. A speculative parallel decompression algorithm on Apache Spark. J Supercomput 73, 4082–4111 (2017). https://doi.org/10.1007/s11227-017-2000-3

Download citation

Published: 21 March 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11227-017-2000-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A speculative parallel decompression algorithm on Apache Spark

Abstract

Access this article

Similar content being viewed by others

PLZMA: A Parallel Data Compression Method for Cloud Computing

A Parallel Algorithm for LZW Decompression, with GPU Implementation

Performance comparison of sequential and parallel compression applications for DNA raw data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A speculative parallel decompression algorithm on Apache Spark

Abstract

Access this article

Similar content being viewed by others

PLZMA: A Parallel Data Compression Method for Cloud Computing

A Parallel Algorithm for LZW Decompression, with GPU Implementation

Performance comparison of sequential and parallel compression applications for DNA raw data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation