Abstract
In the current era of information, communication, and technology the data is being generated at an exponential rate. This provides machine learning and data mining algorithms an opportunity to learn from huge data repositories. However, at the same time, the big data poses many challenges. Data uncertainty being the key concern of the modern data mining systems. This work addresses the problem of extracting frequent itemsets from such large uncertain databases to assist the decision makers in understanding the non-trivial data trends. The usual technique utilized to find frequent itemsets from uncertain databases is known as the Possible Word Semantics (PWS). However, as the database size increases, PWS suffers from performance issues. Therefore, there is a need for efficient frequent pattern mining algorithms. This work presents three techniques to address the issue at hand, namely: 3D linked array-based strategy, connected tree technique, and average probability-based setup with the support of a tree data structure. The objective here is to minimize computational cost by traversing the database only once. The 3D linked array-based solution scans the database only once and stores the support information of the item and its association with other items within the 3D array. For the tree-based method, 1D array is associated with each node of the tree, comprising of support information of the database items and their associations with other items. The average probability-based approach computes the average probability factor and utilizes it to map the uncertain database to a tree. The current proposal addresses attribute uncertainty as well as the tuple uncertainty to map large uncertain databases to the proposed data structures. In addition to introducing the three data structures, this work also presents algorithms to extract frequent itemsets. The proposal is compared with four recent works done in this domain for uncertain data, namely, mining threshold-based (MB) technique, frequent itemsets using nodesets (FIN), prepost + , and uncertain apriori (UApriori). Experiments are performed utilizing four benchmark datasets. The results obtained suggest better performance of the three techniques presented here, while consuming 60% less execution time.
Similar content being viewed by others
References
Aggarwal, C.C., Philip, S.Y.: A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21(5), 609–623 (2009)
Alencar, N., Brayner, A., Filho, J.A., Lopes, H.: Dac scan: a novel scan operator for exploiting SSD internal parallelism. Concurr. Comput. Pract. Exper. 29(8), e4031 (2017)
Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 551–562 (2003)
Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 281–292 (2007)
Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 47–58 (2007)
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. Int. J. Very Large Data Bases 16(4), 523–544 (2007)
Deshpande, A., Guestrin, C., Madden, S.R., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: Proceedings of the Thirtieth international conference on Very large data bases-Volume, vol. 30, pp. 588–599 (2004)
Deng, Z.H., Lv, S.L.: Fast mining frequent itemsets using Nodesets. Expert Syst. Appl. 41(10), 4505–4512 (2014)
Deng, Z.H., Lv, S.L.: PrePost + : An efficient N-lists-based algorithm for mining frequent itemsets via Children–Parent Equivalence pruning. Expert Syst. Appl. 42(13), 5424–5432 (2015)
Djenouri, Y., Belhadi, A., Fournier-Viger, P.: Extracting useful knowledge from event logs: A frequent itemset mining approach. Knowl.-Based Syst. 139, 132–148 (2018)
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM Sigmod Record. 29(2), 1–12 (2000)
Hsieh, T.J.: A micro-view-based data mining approach to diagnose the aging status of heating coils. Knowl.-Based Syst. 143, 10–18 (2017)
Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: a probabilistic database management system. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 1071–1074 (2009)
Hu, W., Chen, T., Shah, S.L.: Detection of frequent alarm patterns in industrial alarm floods using itemset mining methods. IEEE Trans. Ind. Electron. 65(9), 7290–7300 (2018)
Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C., Haas, P.J.: MCDB: A Monte Carlo Approach to managing uncertain data. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 687–700 (2008)
Karim, M.R., Cochez, M., Beyan, O.D., Ahmed, C.F., Decker, S.: Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inform. Sci. 432, 278–300 (2018)
Lee, G., Yun, U., Ryang, H.: An uncertainty-based approach: frequent itemset mining from uncertain data with different item importance. Knowl.-Based Syst. 90, 239–256 (2015)
Leung, C.K.S., MacKinnon, R.K.: Fast algorithms for frequent itemset mining from uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 893–898 (2014)
Leung, C.K.S., Mateo, M.A.F., Brajczuk, D.A.: A tree-based approach for frequent pattern mining from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 653–661 (2008)
Li, H., Zhang, N.: Probabilistic maximal frequent itemset mining over uncertain databases. In: International Conference on Database Systems for Advanced Applications, pp. 149–163 (2016)
Lin, C.W., Hong, T.P.: A new mining approach for uncertain databases using CUFP trees. Expert Syst. Appl. 39(4), 4084–4093 (2012)
Liu, H., Zhang, X., Zhang, X., Cui, Y.: Self-adapted mixture distance measure for clustering uncertain data. Knowl.-Based Syst. 126, 33–47 (2017)
Muhammad, T., Halim, Z.: Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl. Soft Comput. 49, 365–384 (2016)
Nasiri, S., Zenkert, J., Fathi, M.: Improving CBR adaptation for recommendation of associated references in a knowledge-based learning assistant system. Neurocomputing. 250, 5–17 (2017)
Ren, J., Lee, S.D., Chen, X., Kao, B., Cheng, R., Cheung, D.: Naive bayes classification of uncertain data. In: Ninth IEEE International Conference on Data Mining, 2009. ICDM’09, pp. 944–949 (2009)
Shen, J., Zhu, K.: An uncertain single machine scheduling problem with periodic maintenance. Knowl.-Based Syst. 144, 32–41 (2017)
Sistla, A.P., Wolfson, O., Chamberlain, S., Dao, S.: Querying the uncertain position of moving objects. In: Temporal databases: research and practice, pp. 310–337 (1998)
Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C.: Social media analytics–Challenges in topic discovery, data collection, and data preparation. Int. J. Inf. Manag. 39, 156–168 (2018)
Sun, X., Lim, L., Wang, S.: An approximation algorithm of mining frequent itemsets from uncertain dataset. Int. J. Adv. Comput. Technol. 4(3), 42–49 (2012)
Swami, D., Sahoo, B.: Storage Size Estimation for Schemaless Big Data Applications: A JSON-based Overview. In: Intelligent Communication and Computational Technologies, pp. 315–323 (2018)
Tong, W., Leung, C.K., Liu, D., Yu, J.: Probabilistic frequent pattern mining by PUH-mine. In: Asia-Pacific Web Conference, pp. 768–780 (2015)
van Rijsbergen, C.J.: Information retrieval butterworth (1979)
Wang, L., Cheung, D.W.L., Cheng, R., Lee, S.D., Yang, X.S.: Efficient mining of frequent item sets on large uncertain databases. IEEE Trans. Knowl. Data Eng. 24(12), 2170–2183 (2012)
Yang, J., Zhang, Y., Wei, Y.: An improved vertical algorithm for frequent itemset mining from uncertain database. In: Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 1, pp. 355–358 (2017)
Zhang, Y., Qiu, M., Tsai, C.W., Hassan, M.M., Alamri, A.: Health-CPS: Healthcare cyber-physical system assisted by cloud and big data. IEEE Syst. J. 11(1), 88–95 (2017)
Acknowledgments
The authors wish to thank GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under GA-1 scheme.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shah, A., Halim, Z. On Efficient Mining of Frequent Itemsets from Big Uncertain Databases. J Grid Computing 17, 831–850 (2019). https://doi.org/10.1007/s10723-018-9456-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-018-9456-0