Skip to main content
Log in

On Efficient Mining of Frequent Itemsets from Big Uncertain Databases

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

In the current era of information, communication, and technology the data is being generated at an exponential rate. This provides machine learning and data mining algorithms an opportunity to learn from huge data repositories. However, at the same time, the big data poses many challenges. Data uncertainty being the key concern of the modern data mining systems. This work addresses the problem of extracting frequent itemsets from such large uncertain databases to assist the decision makers in understanding the non-trivial data trends. The usual technique utilized to find frequent itemsets from uncertain databases is known as the Possible Word Semantics (PWS). However, as the database size increases, PWS suffers from performance issues. Therefore, there is a need for efficient frequent pattern mining algorithms. This work presents three techniques to address the issue at hand, namely: 3D linked array-based strategy, connected tree technique, and average probability-based setup with the support of a tree data structure. The objective here is to minimize computational cost by traversing the database only once. The 3D linked array-based solution scans the database only once and stores the support information of the item and its association with other items within the 3D array. For the tree-based method, 1D array is associated with each node of the tree, comprising of support information of the database items and their associations with other items. The average probability-based approach computes the average probability factor and utilizes it to map the uncertain database to a tree. The current proposal addresses attribute uncertainty as well as the tuple uncertainty to map large uncertain databases to the proposed data structures. In addition to introducing the three data structures, this work also presents algorithms to extract frequent itemsets. The proposal is compared with four recent works done in this domain for uncertain data, namely, mining threshold-based (MB) technique, frequent itemsets using nodesets (FIN), prepost + , and uncertain apriori (UApriori). Experiments are performed utilizing four benchmark datasets. The results obtained suggest better performance of the three techniques presented here, while consuming 60% less execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal, C.C., Philip, S.Y.: A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21(5), 609–623 (2009)

    Article  Google Scholar 

  2. Alencar, N., Brayner, A., Filho, J.A., Lopes, H.: Dac scan: a novel scan operator for exploiting SSD internal parallelism. Concurr. Comput. Pract. Exper. 29(8), e4031 (2017)

    Article  Google Scholar 

  3. Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 551–562 (2003)

  4. Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 281–292 (2007)

  5. Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 47–58 (2007)

  6. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. Int. J. Very Large Data Bases 16(4), 523–544 (2007)

    Article  Google Scholar 

  7. Deshpande, A., Guestrin, C., Madden, S.R., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: Proceedings of the Thirtieth international conference on Very large data bases-Volume, vol. 30, pp. 588–599 (2004)

    Google Scholar 

  8. Deng, Z.H., Lv, S.L.: Fast mining frequent itemsets using Nodesets. Expert Syst. Appl. 41(10), 4505–4512 (2014)

    Article  Google Scholar 

  9. Deng, Z.H., Lv, S.L.: PrePost + : An efficient N-lists-based algorithm for mining frequent itemsets via Children–Parent Equivalence pruning. Expert Syst. Appl. 42(13), 5424–5432 (2015)

    Article  Google Scholar 

  10. Djenouri, Y., Belhadi, A., Fournier-Viger, P.: Extracting useful knowledge from event logs: A frequent itemset mining approach. Knowl.-Based Syst. 139, 132–148 (2018)

    Article  Google Scholar 

  11. Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)

    Article  MathSciNet  Google Scholar 

  12. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM Sigmod Record. 29(2), 1–12 (2000)

    Article  Google Scholar 

  13. Hsieh, T.J.: A micro-view-based data mining approach to diagnose the aging status of heating coils. Knowl.-Based Syst. 143, 10–18 (2017)

    Article  Google Scholar 

  14. Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: a probabilistic database management system. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 1071–1074 (2009)

  15. Hu, W., Chen, T., Shah, S.L.: Detection of frequent alarm patterns in industrial alarm floods using itemset mining methods. IEEE Trans. Ind. Electron. 65(9), 7290–7300 (2018)

    Article  Google Scholar 

  16. Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C., Haas, P.J.: MCDB: A Monte Carlo Approach to managing uncertain data. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 687–700 (2008)

  17. Karim, M.R., Cochez, M., Beyan, O.D., Ahmed, C.F., Decker, S.: Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inform. Sci. 432, 278–300 (2018)

    Article  MathSciNet  Google Scholar 

  18. Lee, G., Yun, U., Ryang, H.: An uncertainty-based approach: frequent itemset mining from uncertain data with different item importance. Knowl.-Based Syst. 90, 239–256 (2015)

    Article  Google Scholar 

  19. Leung, C.K.S., MacKinnon, R.K.: Fast algorithms for frequent itemset mining from uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 893–898 (2014)

  20. Leung, C.K.S., Mateo, M.A.F., Brajczuk, D.A.: A tree-based approach for frequent pattern mining from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 653–661 (2008)

  21. Li, H., Zhang, N.: Probabilistic maximal frequent itemset mining over uncertain databases. In: International Conference on Database Systems for Advanced Applications, pp. 149–163 (2016)

    Chapter  Google Scholar 

  22. Lin, C.W., Hong, T.P.: A new mining approach for uncertain databases using CUFP trees. Expert Syst. Appl. 39(4), 4084–4093 (2012)

    Article  Google Scholar 

  23. Liu, H., Zhang, X., Zhang, X., Cui, Y.: Self-adapted mixture distance measure for clustering uncertain data. Knowl.-Based Syst. 126, 33–47 (2017)

    Article  Google Scholar 

  24. Muhammad, T., Halim, Z.: Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl. Soft Comput. 49, 365–384 (2016)

    Article  Google Scholar 

  25. Nasiri, S., Zenkert, J., Fathi, M.: Improving CBR adaptation for recommendation of associated references in a knowledge-based learning assistant system. Neurocomputing. 250, 5–17 (2017)

    Article  Google Scholar 

  26. Ren, J., Lee, S.D., Chen, X., Kao, B., Cheng, R., Cheung, D.: Naive bayes classification of uncertain data. In: Ninth IEEE International Conference on Data Mining, 2009. ICDM’09, pp. 944–949 (2009)

  27. Shen, J., Zhu, K.: An uncertain single machine scheduling problem with periodic maintenance. Knowl.-Based Syst. 144, 32–41 (2017)

    Article  Google Scholar 

  28. Sistla, A.P., Wolfson, O., Chamberlain, S., Dao, S.: Querying the uncertain position of moving objects. In: Temporal databases: research and practice, pp. 310–337 (1998)

  29. Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C.: Social media analytics–Challenges in topic discovery, data collection, and data preparation. Int. J. Inf. Manag. 39, 156–168 (2018)

    Article  Google Scholar 

  30. Sun, X., Lim, L., Wang, S.: An approximation algorithm of mining frequent itemsets from uncertain dataset. Int. J. Adv. Comput. Technol. 4(3), 42–49 (2012)

    Google Scholar 

  31. Swami, D., Sahoo, B.: Storage Size Estimation for Schemaless Big Data Applications: A JSON-based Overview. In: Intelligent Communication and Computational Technologies, pp. 315–323 (2018)

    Google Scholar 

  32. Tong, W., Leung, C.K., Liu, D., Yu, J.: Probabilistic frequent pattern mining by PUH-mine. In: Asia-Pacific Web Conference, pp. 768–780 (2015)

    Chapter  Google Scholar 

  33. van Rijsbergen, C.J.: Information retrieval butterworth (1979)

  34. Wang, L., Cheung, D.W.L., Cheng, R., Lee, S.D., Yang, X.S.: Efficient mining of frequent item sets on large uncertain databases. IEEE Trans. Knowl. Data Eng. 24(12), 2170–2183 (2012)

    Article  Google Scholar 

  35. Yang, J., Zhang, Y., Wei, Y.: An improved vertical algorithm for frequent itemset mining from uncertain database. In: Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 1, pp. 355–358 (2017)

  36. Zhang, Y., Qiu, M., Tsai, C.W., Hassan, M.M., Alamri, A.: Health-CPS: Healthcare cyber-physical system assisted by cloud and big data. IEEE Syst. J. 11(1), 88–95 (2017)

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to thank GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under GA-1 scheme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zahid Halim.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shah, A., Halim, Z. On Efficient Mining of Frequent Itemsets from Big Uncertain Databases. J Grid Computing 17, 831–850 (2019). https://doi.org/10.1007/s10723-018-9456-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-018-9456-0

Keywords

Navigation