ABSTRACT
Frequent itemsets mining is one of the interesting applications of data mining. Recently data mining has got a great deal of attention due to the explosive growth in data and the economic and scientific need for turning such data into useful information. However, the traditional frequent itemsets mining algorithms have become inefficient to work with large datasets effectively on a single machine due to computational power and memory limits. Current methods prefer to control the execution time and output by using higher minimum support thresholds, which lead to less candidates and frequent itemsets. In this paper, an improved-version of Apriori like HFDM-EB algorithm that can deal with lower minimum support thresholds is proposed for mining frequent itemsets over big transactional data on Hadoop framework and by utilizing compressed bitmaps. The experimental results show that the improved algorithm is efficient and scalable for mining frequent itemsets in big data.
- Apache Giraph. https://giraph.apache.org/Google Scholar
- Apache Storm. https://storm.apache.org/Google Scholar
- Apache Tez. https://tez.apache.org/Google Scholar
- Apache Hadoop. http://hadoop.apache.orgGoogle Scholar
- IBM Synthetic Data Generator. http://www.philippe-fournier-viger.com/spmf/datasets/IBM_Quest_data_generator.zipGoogle Scholar
- Microsoft's Cloud based Hadoop Distribution. http://www.azure.microsoft.com/en-in/services/hdinsigh/Google Scholar
- What is big data? - Bringing big data to the enterprise. http://www-01.ibm.com/software/au/data/bigdata/Google Scholar
- Agrawal, R. and Shafer, J.C., 1996. Parallel mining of association rules. IEEE Trans. Knowl. Data Eng. 8, (6), 962--969. Google ScholarDigital Library
- Agrawal, R. and Srikant, R., 1994. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the Proceedings of the 20th International Conference on Very Large Data Bases (1994), Morgan Kaufmann Publishers Inc., 672836, 487--499. Google ScholarDigital Library
- Antoshenkov, G., 1995. Byte-aligned bitmap compression. In Data Compression Conference, 1995. DCC '95. Proceedings, Washington, DC, USA, 476. DOI= http://dx.doi.org/10.1109/DCC.1995.515586. Google ScholarDigital Library
- Buehrer, G., Parthasarathy, S., Tatikonda, S., Kurc, T., and Saltz, J., 2007. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming ACM, 2--12. Google ScholarDigital Library
- Chambi, S., Lemire, D., Kaser, O., and Godin, R., 2015. Better bitmap performance with Roaring bitmaps. Softw. Pract. Exper. DOI= http://dx.doi.org/10.1002/spe.2325.Google Scholar
- Chan, C.-Y. and Ioannidis, Y.E., 1998. Bitmap index design and evaluation. In Proceedings of the Proceedings of the 1998 ACM SIGMOD international conference on Management of data (Seattle, Washington, USA1998), ACM, 276336, 355--366. DOI= http://dx.doi.org/10.1145/276304.276336. Google ScholarDigital Library
- Chaudhuri, S. and Dayal, U., 1997. An overview of data warehousing and OLAP technology. SIGMOD Rec. 26, (1), 65--74. DOI= http://dx.doi.org/10.1145/248603.248616. Google ScholarDigital Library
- Colantonio, A. and Pietro, R.D., 2010. Concise: Compressed 'n' Composable Integer Set. Inform. Process Lett. 110, (16), 644--650. DOI= http://dx.doi.org/10.1016/j.ipl.2010.05.018. Google ScholarDigital Library
- Cong, S., Han, J., Hoeflinger, J., and Padua, D., 2005. A sampling-based framework for parallel data mining. In Proceedings of the Proceedings of the 10th ACM SIGPLAN symposium on Principles and practice of parallel programming (Chicago, IL2005), ACM, 1065979, 255--265. DOI= http://dx.doi.org/10.1145/1065944.1065979. Google ScholarDigital Library
- Cukier, K., 2010. Data, data everywhere. special report on managing information. The Economist Newspaper Ltd.Google Scholar
- Davis, K.C. and Gupta, A., 2007. Data Warehouses and OLAP: Concepts, Architectures, and Solutions. In Data Warehouses and OLAP: Concepts, Architectures, and Solutions IRM Press.Google Scholar
- De Alwis, B., Malinga, S., Pradeeban, K., Weerasiri, D., and Perera, S., 2010. Horizontal format data mining with extended bitmaps. In Proceedings of the International Conference of Soft Computing and Pattern Recognition (SoCPaR), 220--223. DOI= http://dx.doi.org/10.1109/SOCPAR.2010.5686156.Google Scholar
- Dean, J. and Ghemawat, S., 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA, 137--150. Google ScholarDigital Library
- El-Hajj, M. and Zaiane, O.R., 2006. Parallel leap: large-scale maximal pattern mining in a distributed environment. In Proceedings of the 12th International Conference on Parallel and Distributed Systems IEEE, 8. Google ScholarDigital Library
- Fan, W. and Bifet, A., 2013. Mining big data: current status, and forecast to the future. SIGKDD Explor. Newsl. 14, (2), 1--5. DOI= http://dx.doi.org/10.1145/2481244.2481246. Google ScholarDigital Library
- Fang, W., Lau, K.K., LU, M., Xiao, X., Lam, C.K., Yang, P.Y., He, B., Luo, Q., Sander, P.V., and Yang, K., 2008. Parallel data mining on graphics processors. Tech. Rep. HKUST-CS08-07. Hong Kong Univ. sci. Technology.Google Scholar
- Farzanyar, Z., Kangavari, M., and Hashemi, S., 2006. An efficient distributed algorithm for mining association rules. In Proceedings of the Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications (Sorrento, Italy2006), Springer-Verlag, 2173782, 383--393. DOI= http://dx.doi.org/10.1007/11946441_38. Google ScholarDigital Library
- Georgii, E., Richter, L., Rückert, U., and Kramer, S., 2005. Analyzing microarray data using quantitative association rules. Bioinformatics 21, (suppl 2), ii123-ii129. Google ScholarDigital Library
- Goethals, B., 2003. Survey on frequent pattern mining. Univ. of Helsinki.Google Scholar
- Han, J., Kamber, M., and Pei, J., 2011. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., MA. Google ScholarDigital Library
- INTEL, 2012. Big Data Analytics: Intel's IT manager survey on how organizations are using big data. Tech. Rep. Intel IT Center Peer Research.Google Scholar
- Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D., 2007. Dryad: distributed data-parallel programs from sequential building blocks 41, (3), 59--72. DOI= http://dx.doi.org/10.1145/1272998.1273005. Google ScholarDigital Library
- Jin, R., Yang, G., and Agrawal, G., 2005. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. IEEE Trans. Knowl. Data Eng. 17, (1), 71--89. Google ScholarDigital Library
- Kun-Ming, Y. and Jia-Ling, Z., 2008. A weighted load-balancing parallel Apriori algorithm for association rule mining. In Proceedings of IEEE International Conference on Granular Computing, 756--761. DOI= http://dx.doi.org/10.1109/GRC.2008.4664768.Google Scholar
- Lee, W. and Stolfo, S.J., 1998. Data mining approaches for intrusion detection. In Proceedings of the Proceedings of the 7th conference on USENIX Security Symposium (San Antonio, TX1998), USENIX Association, 1267555, 6--6. Google ScholarDigital Library
- Li, L. and Zhang, M., 2011. The strategy of mining association rule based on cloud computing. In Proceedings of the International Conference on Business Computing and Global Informatization IEEE, 475--478. Google ScholarDigital Library
- Lin, M.-Y., Lee, P.-Y., and Hsueh, S.-C., 2012. Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication ACM, Kuala Lumpur, Malaysia, 1--8. DOI= http://dx.doi.org/10.1145/2184751.2184842. Google ScholarDigital Library
- Liu, L., LI, E., Zhang, Y., and Tang, Z., 2007. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international conference on Very large data bases VLDB Endowment, 1275--1285. Google ScholarDigital Library
- Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, A.H., 2011. Big data: the next frontier for innovation, competition, and productivity. Tech. Rep. McKinsey Global Institute.Google Scholar
- Mobasher, B., Dai, H., Luo, T., and Nakagawa, M., 2001. Effective personalization based on association rule discovery from web usage data. In Proceedings of the Proceedings of the 3rd international workshop on Web information and data management (Atlanta, Georgia, USA2001), ACM, 502935, 9--15. DOI= http://dx.doi.org/10.1145/502932.502935. Google ScholarDigital Library
- Moens, S., Aksehirli, E., and Goethals, B., 2013. Frequent Itemset Mining for Big Data. In Proceedings of IEEE International Conference on Big Data IEEE, 111--118.Google Scholar
- Navarro, G. and Providel, E., 2012. Fast, small, simple rank/select on bitmaps. In Proceedings of the Proceedings of the 11th international conference on Experimental Algorithms (Bordeaux, France2012), Springer-Verlag, 2366713, 295--306. DOI= http://dx.doi.org/10.1007/978-3-642-30850-5_26. Google ScholarDigital Library
- Ning, L., LI, Z., Qing, H., and Zhongzhi, S., 2012. Parallel Implementation of Apriori Algorithm Based on MapReduce. In Proceedings of the 13th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing, 236--241. DOI= http://dx.doi.org/10.1109/SNPD.2012.31. Google ScholarDigital Library
- O'neil, P.E., 1989. Model 204 Architecture and Performance. In Proceedings of the Proceedings of the 2nd International Workshop on High Performance Transaction Systems (1989), Springer-Verlag, 658338, 40--59. Google ScholarDigital Library
- Oruganti, S., Ding, Q., and Tabrizi, N., 2013. Exploring HADOOP as a Platform for Distributed Association Rule Mining. In Proceedings of the 5th International Conference on Future Computational Technologies and Applications, 62--67.Google Scholar
- Ozkural, E., Ucar, B., and Aykanat, C., 2011. Parallel frequent item set mining with selective item replication. IEEE Trans. Parallel Distrib. Syst. 22, (10), 1632--1640. Google ScholarDigital Library
- Paul, S. and Saravanan, V., 2008. Hash partitioned Apriori in parallel and distributed data mining environment with dynamic data allocation approach. In Proceedings of the International Conference on Computer Science and Information Technology IEEE, 481--485. Google ScholarDigital Library
- Qureshi, Z., Bansal, J., and Bansal, S., 2013. A survey on association rule mining in cloud computing. IJETAE 3, (4), 318--321.Google Scholar
- Shah, K.D. and Mahajan, S., 2009. Maximizing the Efficiency of Parallel Apriori Algorithm. In Proceedings of the International Conference on Advances in Recent Technologies in Communication and Computing, 107--109. DOI= http://dx.doi.org/10.1109/ARTCom.2009.73. Google ScholarDigital Library
- Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N., 2000. Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Explor. Newsl. 1, (2), 12--23. DOI= http://dx.doi.org/10.1145/846183.846188. Google ScholarDigital Library
- Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O'Malley, O., Radia, S., Reed, B., and Baldeschwieler, E., 2013. Apache Hadoop YARN: yet another resource negotiator. In Proceedings of the Proceedings of the 4th annual Symposium on Cloud Computing (Santa Clara, California2013), ACM, 2523633, 1--16. DOI= http://dx.doi.org/10.1145/2523616.2523633. Google ScholarDigital Library
- Wu, K., Otoo, E., and Shoshani, A., 2006. Optimizing Bitmap Indices with Efficient Compression. ACM T. DATABASE SYST. 31, (1), 1--38. Google ScholarDigital Library
- Yahya, O., Hegazy, O., and Ezat, E., 2012. An Efficient Implementation of Apriori Algorithm Based on Hadoop-Mapreduce Model. IJRIC 12, (7), 59--67.Google Scholar
- Yanbin, Y. and Chia-Chu, C., 2006. A Parallel Apriori Algorithm for Frequent Itemsets Mining. In Proceedings of the 4th International Conference on Software Engineering Research, Management and Applications (SERA'06), 87--94. DOI= http://dx.doi.org/10.1109/SERA.2006.6. Google ScholarDigital Library
- Yang, X.Y., Liu, Z., and Fu, Y., 2010. MapReduce as a programming model for association rules algorithm on Hadoop. In Proceedings of the 3rd International Conference on Information Sciences and Interaction Sciences IEEE, 99--102.Google Scholar
- Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., and Stoica, I., 2010. Spark: cluster computing with working sets. In Proceedings of the Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (Boston, MA2010), USENIX Association, 1863113, 10--10. Google ScholarDigital Library
- Zaki, M.J., 1999. Parallel and Distributed Association mining: A survey. IEEE CONCURRENCY 7, (4), 14--25. DOI= http://dx.doi.org/10.1109/4434.806975. Google ScholarDigital Library
Recommendations
FPGA/GPU-based Acceleration for Frequent Itemsets Mining: A Comprehensive Review
In data mining, Frequent Itemsets Mining is a technique used in several domains with notable results. However, the large volume of data in modern datasets increases the processing time of Frequent Itemset Mining algorithms, making them unsuitable for many ...
Mining of frequent itemsets with JoinFI-mine algorithm
AIKED'11: Proceedings of the 10th WSEAS international conference on Artificial intelligence, knowledge engineering and data basesAssociation rule mining among frequent items has been widely studied in data mining field. Many researches have improved the algorithm for generation of all the frequent itemsets. In this paper, we proposed a new algorithm to mine all frequents itemsets ...
Distributed Mining of Maximal Frequent Itemsets on a Data Grid System
In this paper, we propose a new algorithm, named Grid-based Distributed Max-Miner (GridDMM), for mining maximal frequent itemsets from databases on a Data Grid. A frequent itemset is maximal if none of its supersets is frequent. GridDMM is ...
Comments