On Efficient Mining of Frequent Itemsets from Big Uncertain Databases

Shah, Ahsan; Halim, Zahid

doi:10.1007/s10723-018-9456-0

On Efficient Mining of Frequent Itemsets from Big Uncertain Databases

Published: 06 August 2018

Volume 17, pages 831–850, (2019)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Ahsan Shah¹ &
Zahid Halim²

239 Accesses
14 Citations
Explore all metrics

Abstract

In the current era of information, communication, and technology the data is being generated at an exponential rate. This provides machine learning and data mining algorithms an opportunity to learn from huge data repositories. However, at the same time, the big data poses many challenges. Data uncertainty being the key concern of the modern data mining systems. This work addresses the problem of extracting frequent itemsets from such large uncertain databases to assist the decision makers in understanding the non-trivial data trends. The usual technique utilized to find frequent itemsets from uncertain databases is known as the Possible Word Semantics (PWS). However, as the database size increases, PWS suffers from performance issues. Therefore, there is a need for efficient frequent pattern mining algorithms. This work presents three techniques to address the issue at hand, namely: 3D linked array-based strategy, connected tree technique, and average probability-based setup with the support of a tree data structure. The objective here is to minimize computational cost by traversing the database only once. The 3D linked array-based solution scans the database only once and stores the support information of the item and its association with other items within the 3D array. For the tree-based method, 1D array is associated with each node of the tree, comprising of support information of the database items and their associations with other items. The average probability-based approach computes the average probability factor and utilizes it to map the uncertain database to a tree. The current proposal addresses attribute uncertainty as well as the tuple uncertainty to map large uncertain databases to the proposed data structures. In addition to introducing the three data structures, this work also presents algorithms to extract frequent itemsets. The proposal is compared with four recent works done in this domain for uncertain data, namely, mining threshold-based (MB) technique, frequent itemsets using nodesets (FIN), prepost + , and uncertain apriori (UApriori). Experiments are performed utilizing four benchmark datasets. The results obtained suggest better performance of the three techniques presented here, while consuming 60% less execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

Incremental feature selection approach to multi-dimensional variation based on matrix dominance conditional entropy for ordered data set

Article 10 April 2024

References

Aggarwal, C.C., Philip, S.Y.: A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21(5), 609–623 (2009)
Article Google Scholar
Alencar, N., Brayner, A., Filho, J.A., Lopes, H.: Dac scan: a novel scan operator for exploiting SSD internal parallelism. Concurr. Comput. Pract. Exper. 29(8), e4031 (2017)
Article Google Scholar
Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 551–562 (2003)
Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 281–292 (2007)
Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 47–58 (2007)
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. Int. J. Very Large Data Bases 16(4), 523–544 (2007)
Article Google Scholar
Deshpande, A., Guestrin, C., Madden, S.R., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: Proceedings of the Thirtieth international conference on Very large data bases-Volume, vol. 30, pp. 588–599 (2004)
Google Scholar
Deng, Z.H., Lv, S.L.: Fast mining frequent itemsets using Nodesets. Expert Syst. Appl. 41(10), 4505–4512 (2014)
Article Google Scholar
Deng, Z.H., Lv, S.L.: PrePost + : An efficient N-lists-based algorithm for mining frequent itemsets via Children–Parent Equivalence pruning. Expert Syst. Appl. 42(13), 5424–5432 (2015)
Article Google Scholar
Djenouri, Y., Belhadi, A., Fournier-Viger, P.: Extracting useful knowledge from event logs: A frequent itemset mining approach. Knowl.-Based Syst. 139, 132–148 (2018)
Article Google Scholar
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)
Article MathSciNet Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM Sigmod Record. 29(2), 1–12 (2000)
Article Google Scholar
Hsieh, T.J.: A micro-view-based data mining approach to diagnose the aging status of heating coils. Knowl.-Based Syst. 143, 10–18 (2017)
Article Google Scholar
Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: a probabilistic database management system. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 1071–1074 (2009)
Hu, W., Chen, T., Shah, S.L.: Detection of frequent alarm patterns in industrial alarm floods using itemset mining methods. IEEE Trans. Ind. Electron. 65(9), 7290–7300 (2018)
Article Google Scholar
Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C., Haas, P.J.: MCDB: A Monte Carlo Approach to managing uncertain data. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 687–700 (2008)
Karim, M.R., Cochez, M., Beyan, O.D., Ahmed, C.F., Decker, S.: Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inform. Sci. 432, 278–300 (2018)
Article MathSciNet Google Scholar
Lee, G., Yun, U., Ryang, H.: An uncertainty-based approach: frequent itemset mining from uncertain data with different item importance. Knowl.-Based Syst. 90, 239–256 (2015)
Article Google Scholar
Leung, C.K.S., MacKinnon, R.K.: Fast algorithms for frequent itemset mining from uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 893–898 (2014)
Leung, C.K.S., Mateo, M.A.F., Brajczuk, D.A.: A tree-based approach for frequent pattern mining from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 653–661 (2008)
Li, H., Zhang, N.: Probabilistic maximal frequent itemset mining over uncertain databases. In: International Conference on Database Systems for Advanced Applications, pp. 149–163 (2016)
Chapter Google Scholar
Lin, C.W., Hong, T.P.: A new mining approach for uncertain databases using CUFP trees. Expert Syst. Appl. 39(4), 4084–4093 (2012)
Article Google Scholar
Liu, H., Zhang, X., Zhang, X., Cui, Y.: Self-adapted mixture distance measure for clustering uncertain data. Knowl.-Based Syst. 126, 33–47 (2017)
Article Google Scholar
Muhammad, T., Halim, Z.: Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl. Soft Comput. 49, 365–384 (2016)
Article Google Scholar
Nasiri, S., Zenkert, J., Fathi, M.: Improving CBR adaptation for recommendation of associated references in a knowledge-based learning assistant system. Neurocomputing. 250, 5–17 (2017)
Article Google Scholar
Ren, J., Lee, S.D., Chen, X., Kao, B., Cheng, R., Cheung, D.: Naive bayes classification of uncertain data. In: Ninth IEEE International Conference on Data Mining, 2009. ICDM’09, pp. 944–949 (2009)
Shen, J., Zhu, K.: An uncertain single machine scheduling problem with periodic maintenance. Knowl.-Based Syst. 144, 32–41 (2017)
Article Google Scholar
Sistla, A.P., Wolfson, O., Chamberlain, S., Dao, S.: Querying the uncertain position of moving objects. In: Temporal databases: research and practice, pp. 310–337 (1998)
Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C.: Social media analytics–Challenges in topic discovery, data collection, and data preparation. Int. J. Inf. Manag. 39, 156–168 (2018)
Article Google Scholar
Sun, X., Lim, L., Wang, S.: An approximation algorithm of mining frequent itemsets from uncertain dataset. Int. J. Adv. Comput. Technol. 4(3), 42–49 (2012)
Google Scholar
Swami, D., Sahoo, B.: Storage Size Estimation for Schemaless Big Data Applications: A JSON-based Overview. In: Intelligent Communication and Computational Technologies, pp. 315–323 (2018)
Google Scholar
Tong, W., Leung, C.K., Liu, D., Yu, J.: Probabilistic frequent pattern mining by PUH-mine. In: Asia-Pacific Web Conference, pp. 768–780 (2015)
Chapter Google Scholar
van Rijsbergen, C.J.: Information retrieval butterworth (1979)
Wang, L., Cheung, D.W.L., Cheng, R., Lee, S.D., Yang, X.S.: Efficient mining of frequent item sets on large uncertain databases. IEEE Trans. Knowl. Data Eng. 24(12), 2170–2183 (2012)
Article Google Scholar
Yang, J., Zhang, Y., Wei, Y.: An improved vertical algorithm for frequent itemset mining from uncertain database. In: Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 1, pp. 355–358 (2017)
Zhang, Y., Qiu, M., Tsai, C.W., Hassan, M.M., Alamri, A.: Health-CPS: Healthcare cyber-physical system assisted by cloud and big data. IEEE Syst. J. 11(1), 88–95 (2017)
Article Google Scholar

Download references

Acknowledgments

The authors wish to thank GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under GA-1 scheme.

Author information

Authors and Affiliations

Department of Computer Science, National University of Computer and Emerging Sciences, Karachi, 74600, Pakistan
Ahsan Shah
The Machine Intelligence Research Group (MInG), Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, 23460, Pakistan
Zahid Halim

Authors

Ahsan Shah
View author publications
You can also search for this author in PubMed Google Scholar
Zahid Halim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zahid Halim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shah, A., Halim, Z. On Efficient Mining of Frequent Itemsets from Big Uncertain Databases. J Grid Computing 17, 831–850 (2019). https://doi.org/10.1007/s10723-018-9456-0

Download citation

Received: 26 February 2018
Accepted: 01 August 2018
Published: 06 August 2018
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10723-018-9456-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Efficient Mining of Frequent Itemsets from Big Uncertain Databases

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Uncertainty in big data analytics: survey, opportunities, and challenges

Incremental feature selection approach to multi-dimensional variation based on matrix dominance conditional entropy for ordered data set

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On Efficient Mining of Frequent Itemsets from Big Uncertain Databases

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Uncertainty in big data analytics: survey, opportunities, and challenges

Incremental feature selection approach to multi-dimensional variation based on matrix dominance conditional entropy for ordered data set

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation