PFIMD: a parallel MapReduce-based algorithm for frequent itemset mining

Yimin, Mao; Junhao, Geng; Mwakapesa, Deborah Simon; Nanehkaran, Yaser Ahangari; Chi, Zhang; Xiaoheng, Deng; Zhigang, Chen

doi:10.1007/s00530-020-00725-x

PFIMD: a parallel MapReduce-based algorithm for frequent itemset mining

Special Issue Paper
Published: 13 March 2021

Volume 27, pages 709–722, (2021)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Mao Yimin¹,
Geng Junhao¹,
Deborah Simon Mwakapesa¹,
Yaser Ahangari Nanehkaran¹,
Zhang Chi¹,
Deng Xiaoheng² &
…
Chen Zhigang²

538 Accesses
10 Citations
Explore all metrics

Abstract

Frequent itemset mining (FIM) is a significant data mining technique which is widely adopted in numerous applications for exploring frequent items. With the rapid growth and expansion of datasets, FIM has become an interesting topic for many researchers, which has triggered many innovations of numerous FIM algorithms in the big data environment. This study aims to design an optimization parallel frequent itemset mining algorithm based on MapReduce, named as \({\text{PFIMD}}\) algorithm, to deal with the problem of time and space complexity during processing and computing item sets, as well as the failure to adequately balance the load among parallel tasks in the existing parallel FIM algorithms. First, a structure called \({\text{DiffNodeset}}\) is adopted for avoiding the increase of \(N{-}list\) cardinality in the \({\text{MRPrePost}}\) algorithm effectively. Then, a 2-way comparison strategy is designed to speed up the \({\text{DiffNodeset}}\) generation of 2-itemsets and reduce the time complexity of the algorithm. Finally, the steps of the improved algorithm are parallelized using the cloud computing platform Hadoop and the programming model MapReduce. Moreover, to achieve a uniform grouping of each item in \(F{-}list\), a load balancing strategy based on dynamic grouping is proposed, which solves the problem of uneven load of each node in the cluster. The experimental results show that the modified algorithm not only overcomes the shortcoming of \({\text{MRPrePost}}\) in the big data environment, but also greatly reduces the time and space complexity. Finally, the specific applications of \({\text{PFIMD}}\) algorithm in several multimedia data sets are listed to illustrate its universality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

MapReduce based parallel fuzzy-rough attribute reduction using discernibility matrix

Article 24 April 2021

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

References

ZHU, X.F., JIN, Z., JI, R.R: Learning high-dimensional multimedia data. Multimedia Syst. 23, 281–283 (2017)
Gao, L., Song, J., Liu, X., et al.: Learning in high-dimensional multimedia data: the state of the art. Multimedia Syst. 23(3), 303–313 (2017)
Article Google Scholar
Fahad, A., Alshatri, N., Tari, Z., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg Topics Comput. 2(3), 267–279 (2014)
Article Google Scholar
Sethi, K.K., Ramesh, D.: HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 73(8), 3652–3668 (2017)
Article Google Scholar
Madni, H.A., Anwar, Z., Shah, M.A.: Data mining techniques and applications — A decade review. In: 2017 23rd International Conference on Automation and Computing (ICAC). IEEE, pp. 1–7 (2017)
Solanki, S.K., Patel, J.T.: A Survey on Association Rule Mining. In: Fifth International Conference on Advanced Computing & Communication Technologies. IEEE, pp. 212–216 (2015)
Saravanan, S., Venkatachalam, V.: A New Method for Acquiring Relevant Data Partitioning by Optimization Techniques. In: International Conference on Recent Trends & Challenges in Computational Models. 1, 87–93 (2017)
Maleki, N., Rahmani, A.M., Conti, M.: MapReduce: an infrastructure review and research insights. J. Supercomput. 75(10), 6934–7002 (2019)
Article Google Scholar
Pericini, M., Leite, L., De, Carvalho-Junior, F., et al.: MAPSkew: Metaheuristic Approaches for Partitioning Skew in MapReduce. Algorithms. 12(1), 5 (2018)
Singh, S., Garg, R., Mishra, P.K.: Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster. Comput. Electr. Eng. 67, 348–364 (2018)
Article Google Scholar
Luna, J.M., Padillo, F., Pechenizkiy, M., et al.: Apriori versions based on map reduce for mining frequent patterns on big data. IEEE Transactions on Cybernetics. 48(10), 2851–2865 (2018)
Article Google Scholar
ZHOU X, HUANG Y. An improved parallel association rules algorithm based on MapReduce framework for big data. In: Proc of the 10th International Conference on Natural Computation, pp. 284–288 (2014)
Li, H., Wang, Y., Zhang, D.: PFP: parallel FP-growth for query recommendation. In: Proc of ACM Conference on Recommender systems, pp. 107–114 (2008)
Wang, Y., Zhang, Z., Wang, F.: A parallel algorithm of association rules based on cloud computing. In: Proc of International ICST Conference on Communications and Network in China, pp. 415–419 (2013)
Chen, X.S., Zhang, S., Dong, H., et al.: FP-Growth algorithm based on Boolean matrix and MapReduce. J. South China Univ. Technol. 42(1), 135–141 (2014)
Google Scholar
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: Proc of International Conference on Advanced Cloud and Bigdata, pp. 111–118 (2013)
ZHANG, Z.G., JI, G.L., TANG, M.M.: MREclat: An Algorithm for Parallel Mining Frequent Itemset. In: International Conference on Advanced Cloud & Big Data. IEEE Computer Society. pp. 117–180 (2013)
Keerthi, K., Saritha, S.J.: ECLAT: Frequent Itemset using MapReduce. In: International Conference on Energy, Communication, Data Analytics and Soft Computing, pp. 3744–3748 (2017)
Liao, J.G., Zhao, Y.L., Long, S.Q.: MRPrePost: a parallel algorithm adapted for mining big data. In: Proc of IEEE Workshop on Electronics, Computer and Applications, pp. 564–568 (2014)
Han, J.W., Pei, J., Yin, Y.W.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Based Syst. 8(1), 53–87 (2004)
Article MathSciNet Google Scholar
Deng, Z.H., Wang, Z.H., Jiang, J.J.: A new algorithm for fast mining frequent itemsets using N-lists. Sci. China Inf. Sci. 55(9), 2008–2030 (2012)
Article MathSciNet Google Scholar
Deng, Z.H.: DiffNodesets: an efficient structure for fast mining frequent itemsets. Appl. Soft Comput. 41, 214–223 (2016)
Article Google Scholar
Webdocs: dataset was built from a spidered collection of web html documents. Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri. http://fimi.uantwerpen.be/data/. Accessed 12 Oct 2019
Koarak: clickstream data form Hungary’s online news portal. Ferenc Bodon. http://www.philippe-fournier-viger.com/spmf/datasets/kosarak_sequences.txt. Accessed 5 Oct 2019
Susy: dataset records the data of the processed particles detected by the particle accelerator. Zhongjie Z. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php. Accessed 5 Oct 2019

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (41562019) and the National Key Research and Development Program of China (2018YFC1504705).

Author information

Authors and Affiliations

School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, 341000, Jiangxi, China
Mao Yimin, Geng Junhao, Deborah Simon Mwakapesa, Yaser Ahangari Nanehkaran & Zhang Chi
School of Computer Science and Engineering, Central South University, Changsha, 410083, Hunan, China
Deng Xiaoheng & Chen Zhigang

Authors

Mao Yimin
View author publications
You can also search for this author in PubMed Google Scholar
Geng Junhao
View author publications
You can also search for this author in PubMed Google Scholar
Deborah Simon Mwakapesa
View author publications
You can also search for this author in PubMed Google Scholar
Yaser Ahangari Nanehkaran
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Chi
View author publications
You can also search for this author in PubMed Google Scholar
Deng Xiaoheng
View author publications
You can also search for this author in PubMed Google Scholar
Chen Zhigang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deng Xiaoheng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yimin, M., Junhao, G., Mwakapesa, D.S. et al. PFIMD: a parallel MapReduce-based algorithm for frequent itemset mining. Multimedia Systems 27, 709–722 (2021). https://doi.org/10.1007/s00530-020-00725-x

Download citation

Received: 02 July 2020
Accepted: 27 November 2020
Published: 13 March 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s00530-020-00725-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PFIMD: a parallel MapReduce-based algorithm for frequent itemset mining

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

MapReduce based parallel fuzzy-rough attribute reduction using discernibility matrix

The big data system, components, tools, and technologies: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PFIMD: a parallel MapReduce-based algorithm for frequent itemset mining

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

MapReduce based parallel fuzzy-rough attribute reduction using discernibility matrix

The big data system, components, tools, and technologies: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation