Skip to main content
Log in

A speculative parallel decompression algorithm on Apache Spark

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Data decompression is one of the most important techniques in data processing and has been widely used in multimedia information transmission and processing. However, the existing decompression algorithms on multicore platforms are time-consuming and do not support large data well. In order to expand parallelism and enhance decompression efficiency on large-scale datasets, based on the software thread-level speculation technique, this paper raises a speculative parallel decompression algorithm on Apache Spark. By analyzing the data structure of the compressed data, the algorithm firstly hires a function to divide compressed data into blocks which can be decompressed independently and then spawns a number of threads to speculatively decompress data blocks in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, the proposed algorithm is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve 2.6\(\times \) speedup when comparing with the traditional approach in average. In addition, with the growing number of working nodes, the execution time cost decreases gradually, and the speedup scales linearly. The results indicate that the decompression efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Nelson M, Gailly J-L (1996) The data compression book, vol 2. M&T Books, New York

    Google Scholar 

  2. Shvachko K et al. (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE

  3. Shoro AG, Soomro TR et al (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol 15(1):7–14

  4. Slagter K et al (2013) An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J Supercomput 66(1):539–555

    Article  Google Scholar 

  5. Cui X et al (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259

    Article  Google Scholar 

  6. Marcuello P, Tubella J, González A (1999) Value prediction for speculative multithreaded architectures. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society

  7. Porwal S et al (2013) Data compression methodologies for lossless data and comparison between algorithms. IJESIT 2(2):142–7

    Google Scholar 

  8. Liu B et al (2014) A thread partitioning approach for speculative multithreading. J Supercomput 67(3):778–805

    Article  Google Scholar 

  9. Jang H, Kim C, Lee JW (2013) Practical speculative parallelization of variable-length decompression algorithms. In: ACM SIGPLAN Notices. ACM

  10. Zaharia M et al (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing

  11. Gailly J-L, Adler M (2003) The gzip home page. http://www.gzip.org/

  12. Seward J (2000) The bzip2 and libbzip2 official home page. http://www.bzip.org

  13. Gilchrist J (2004) Parallel data compression with bzip2. In: Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems

  14. Adler M (2015) PIGZ: a parallel implementation of gzip for modern multiprocessor, multi-core machines. http://www.zlib.net/pigz/

  15. Klein ST, Wiseman Y (2003) Parallel Huffman decoding with applications to JPEG files. Comput J 46(5):487–497

    Article  MATH  Google Scholar 

  16. Liu W et al (2006) POSH: a TLS compiler that exploits program structure. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 158–167

  17. Raman E et al (2008) Spice: speculative parallel iteration chunk execution. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization

  18. Zilles C, Sohi G (2002) Master/slave speculative parallelization. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

  19. Witte EE, Chamberlain RD, Franklin MA (1991) Parallel simulated annealing using speculative computation. IEEE Trans Parallel Distrib Syst 2(4):483–494

    Article  Google Scholar 

  20. Zhao WJ, Yang HB, Wu Y (2010) Parallel genetic algorithm based on thread-level speculation. In: 2010 International Conference on Audio Language and Image Processing (ICALIP)

  21. Zaharia M et al (2012) Fast and interactive analytics over hadoop data with spark. USENIX Login 37(4):45–51

    Google Scholar 

  22. Lin C-Y et al (2014) Large-scale logistic regression and linear support vector machines using spark. In: 2014 IEEE International Conference on Big Data (Big Data). IEEE

  23. Qiu H et al (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In: 2014 IEEE International on Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE

  24. Islam NS et al (2015) Performance characterization and acceleration of in-memory file systems for hadoop and spark applications on HPC clusters. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE

  25. Liang C, Ru L, Zhu X (2007) R-SpamRank: a spam detection algorithm based on link analysis. J Comput Inf Syst 3(4):1705–1712

  26. Xu D et al (2011) Predicting epidemic tendency through search behavior analysis. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence

  27. Liu Y et al (2011) How do users describe their information need: query recommendation based on snippet click model. Expert Syst Appl 38(11):13847–13856

    Google Scholar 

  28. Amdahl GM (2013) Computer architecture and amdahl’s law. Computer 46(12):38–46

    Article  Google Scholar 

  29. Zeigler BP, Nutaro JJ, Seo C (2015) What’s the best possible speedup achievable in distributed simulation: Amdahl’s law reconstructed. In: Proceedings of the Symposium on Theory of Modeling and Simulation: DEVS Integrative M&S Symposium. Society for Computer Simulation International

Download references

Acknowledgements

We thank our 3 C laboratory for their support and feedback during this work. We also thank the anonymous reviewers for their insightful comments and suggestions. We thank our colleagues for their collaboration and the present work. We also thank all the reviewers for their specific comments and suggestions. This work is supported by National Natural Science Foundation of China through Grant No. 61640219 and Doctoral Fund of Ministry of Education of China under Grant No. 2013021110012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yinliang Zhao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Zhao, Y., Liu, Y. et al. A speculative parallel decompression algorithm on Apache Spark. J Supercomput 73, 4082–4111 (2017). https://doi.org/10.1007/s11227-017-2000-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2000-3

Keywords

Navigation