skip to main content
10.1145/2908446.2908457acmotherconferencesArticle/Chapter ViewAbstractPublication PagesinfosConference Proceedingsconference-collections
research-article

Compressed Bitmaps Based Frequent Itemsets Mining on Hadoop

Authors Info & Claims
Published:09 May 2016Publication History

ABSTRACT

Frequent itemsets mining is one of the interesting applications of data mining. Recently data mining has got a great deal of attention due to the explosive growth in data and the economic and scientific need for turning such data into useful information. However, the traditional frequent itemsets mining algorithms have become inefficient to work with large datasets effectively on a single machine due to computational power and memory limits. Current methods prefer to control the execution time and output by using higher minimum support thresholds, which lead to less candidates and frequent itemsets. In this paper, an improved-version of Apriori like HFDM-EB algorithm that can deal with lower minimum support thresholds is proposed for mining frequent itemsets over big transactional data on Hadoop framework and by utilizing compressed bitmaps. The experimental results show that the improved algorithm is efficient and scalable for mining frequent itemsets in big data.

References

  1. Apache Giraph. https://giraph.apache.org/Google ScholarGoogle Scholar
  2. Apache Storm. https://storm.apache.org/Google ScholarGoogle Scholar
  3. Apache Tez. https://tez.apache.org/Google ScholarGoogle Scholar
  4. Apache Hadoop. http://hadoop.apache.orgGoogle ScholarGoogle Scholar
  5. IBM Synthetic Data Generator. http://www.philippe-fournier-viger.com/spmf/datasets/IBM_Quest_data_generator.zipGoogle ScholarGoogle Scholar
  6. Microsoft's Cloud based Hadoop Distribution. http://www.azure.microsoft.com/en-in/services/hdinsigh/Google ScholarGoogle Scholar
  7. What is big data? - Bringing big data to the enterprise. http://www-01.ibm.com/software/au/data/bigdata/Google ScholarGoogle Scholar
  8. Agrawal, R. and Shafer, J.C., 1996. Parallel mining of association rules. IEEE Trans. Knowl. Data Eng. 8, (6), 962--969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Agrawal, R. and Srikant, R., 1994. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the Proceedings of the 20th International Conference on Very Large Data Bases (1994), Morgan Kaufmann Publishers Inc., 672836, 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Antoshenkov, G., 1995. Byte-aligned bitmap compression. In Data Compression Conference, 1995. DCC '95. Proceedings, Washington, DC, USA, 476. DOI= http://dx.doi.org/10.1109/DCC.1995.515586. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Buehrer, G., Parthasarathy, S., Tatikonda, S., Kurc, T., and Saltz, J., 2007. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming ACM, 2--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chambi, S., Lemire, D., Kaser, O., and Godin, R., 2015. Better bitmap performance with Roaring bitmaps. Softw. Pract. Exper. DOI= http://dx.doi.org/10.1002/spe.2325.Google ScholarGoogle Scholar
  13. Chan, C.-Y. and Ioannidis, Y.E., 1998. Bitmap index design and evaluation. In Proceedings of the Proceedings of the 1998 ACM SIGMOD international conference on Management of data (Seattle, Washington, USA1998), ACM, 276336, 355--366. DOI= http://dx.doi.org/10.1145/276304.276336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chaudhuri, S. and Dayal, U., 1997. An overview of data warehousing and OLAP technology. SIGMOD Rec. 26, (1), 65--74. DOI= http://dx.doi.org/10.1145/248603.248616. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Colantonio, A. and Pietro, R.D., 2010. Concise: Compressed 'n' Composable Integer Set. Inform. Process Lett. 110, (16), 644--650. DOI= http://dx.doi.org/10.1016/j.ipl.2010.05.018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cong, S., Han, J., Hoeflinger, J., and Padua, D., 2005. A sampling-based framework for parallel data mining. In Proceedings of the Proceedings of the 10th ACM SIGPLAN symposium on Principles and practice of parallel programming (Chicago, IL2005), ACM, 1065979, 255--265. DOI= http://dx.doi.org/10.1145/1065944.1065979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Cukier, K., 2010. Data, data everywhere. special report on managing information. The Economist Newspaper Ltd.Google ScholarGoogle Scholar
  18. Davis, K.C. and Gupta, A., 2007. Data Warehouses and OLAP: Concepts, Architectures, and Solutions. In Data Warehouses and OLAP: Concepts, Architectures, and Solutions IRM Press.Google ScholarGoogle Scholar
  19. De Alwis, B., Malinga, S., Pradeeban, K., Weerasiri, D., and Perera, S., 2010. Horizontal format data mining with extended bitmaps. In Proceedings of the International Conference of Soft Computing and Pattern Recognition (SoCPaR), 220--223. DOI= http://dx.doi.org/10.1109/SOCPAR.2010.5686156.Google ScholarGoogle Scholar
  20. Dean, J. and Ghemawat, S., 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA, 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. El-Hajj, M. and Zaiane, O.R., 2006. Parallel leap: large-scale maximal pattern mining in a distributed environment. In Proceedings of the 12th International Conference on Parallel and Distributed Systems IEEE, 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fan, W. and Bifet, A., 2013. Mining big data: current status, and forecast to the future. SIGKDD Explor. Newsl. 14, (2), 1--5. DOI= http://dx.doi.org/10.1145/2481244.2481246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fang, W., Lau, K.K., LU, M., Xiao, X., Lam, C.K., Yang, P.Y., He, B., Luo, Q., Sander, P.V., and Yang, K., 2008. Parallel data mining on graphics processors. Tech. Rep. HKUST-CS08-07. Hong Kong Univ. sci. Technology.Google ScholarGoogle Scholar
  24. Farzanyar, Z., Kangavari, M., and Hashemi, S., 2006. An efficient distributed algorithm for mining association rules. In Proceedings of the Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications (Sorrento, Italy2006), Springer-Verlag, 2173782, 383--393. DOI= http://dx.doi.org/10.1007/11946441_38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Georgii, E., Richter, L., Rückert, U., and Kramer, S., 2005. Analyzing microarray data using quantitative association rules. Bioinformatics 21, (suppl 2), ii123-ii129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Goethals, B., 2003. Survey on frequent pattern mining. Univ. of Helsinki.Google ScholarGoogle Scholar
  27. Han, J., Kamber, M., and Pei, J., 2011. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. INTEL, 2012. Big Data Analytics: Intel's IT manager survey on how organizations are using big data. Tech. Rep. Intel IT Center Peer Research.Google ScholarGoogle Scholar
  29. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D., 2007. Dryad: distributed data-parallel programs from sequential building blocks 41, (3), 59--72. DOI= http://dx.doi.org/10.1145/1272998.1273005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jin, R., Yang, G., and Agrawal, G., 2005. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. IEEE Trans. Knowl. Data Eng. 17, (1), 71--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kun-Ming, Y. and Jia-Ling, Z., 2008. A weighted load-balancing parallel Apriori algorithm for association rule mining. In Proceedings of IEEE International Conference on Granular Computing, 756--761. DOI= http://dx.doi.org/10.1109/GRC.2008.4664768.Google ScholarGoogle Scholar
  32. Lee, W. and Stolfo, S.J., 1998. Data mining approaches for intrusion detection. In Proceedings of the Proceedings of the 7th conference on USENIX Security Symposium (San Antonio, TX1998), USENIX Association, 1267555, 6--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Li, L. and Zhang, M., 2011. The strategy of mining association rule based on cloud computing. In Proceedings of the International Conference on Business Computing and Global Informatization IEEE, 475--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lin, M.-Y., Lee, P.-Y., and Hsueh, S.-C., 2012. Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication ACM, Kuala Lumpur, Malaysia, 1--8. DOI= http://dx.doi.org/10.1145/2184751.2184842. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Liu, L., LI, E., Zhang, Y., and Tang, Z., 2007. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international conference on Very large data bases VLDB Endowment, 1275--1285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, A.H., 2011. Big data: the next frontier for innovation, competition, and productivity. Tech. Rep. McKinsey Global Institute.Google ScholarGoogle Scholar
  37. Mobasher, B., Dai, H., Luo, T., and Nakagawa, M., 2001. Effective personalization based on association rule discovery from web usage data. In Proceedings of the Proceedings of the 3rd international workshop on Web information and data management (Atlanta, Georgia, USA2001), ACM, 502935, 9--15. DOI= http://dx.doi.org/10.1145/502932.502935. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Moens, S., Aksehirli, E., and Goethals, B., 2013. Frequent Itemset Mining for Big Data. In Proceedings of IEEE International Conference on Big Data IEEE, 111--118.Google ScholarGoogle Scholar
  39. Navarro, G. and Providel, E., 2012. Fast, small, simple rank/select on bitmaps. In Proceedings of the Proceedings of the 11th international conference on Experimental Algorithms (Bordeaux, France2012), Springer-Verlag, 2366713, 295--306. DOI= http://dx.doi.org/10.1007/978-3-642-30850-5_26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ning, L., LI, Z., Qing, H., and Zhongzhi, S., 2012. Parallel Implementation of Apriori Algorithm Based on MapReduce. In Proceedings of the 13th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing, 236--241. DOI= http://dx.doi.org/10.1109/SNPD.2012.31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. O'neil, P.E., 1989. Model 204 Architecture and Performance. In Proceedings of the Proceedings of the 2nd International Workshop on High Performance Transaction Systems (1989), Springer-Verlag, 658338, 40--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Oruganti, S., Ding, Q., and Tabrizi, N., 2013. Exploring HADOOP as a Platform for Distributed Association Rule Mining. In Proceedings of the 5th International Conference on Future Computational Technologies and Applications, 62--67.Google ScholarGoogle Scholar
  43. Ozkural, E., Ucar, B., and Aykanat, C., 2011. Parallel frequent item set mining with selective item replication. IEEE Trans. Parallel Distrib. Syst. 22, (10), 1632--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Paul, S. and Saravanan, V., 2008. Hash partitioned Apriori in parallel and distributed data mining environment with dynamic data allocation approach. In Proceedings of the International Conference on Computer Science and Information Technology IEEE, 481--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Qureshi, Z., Bansal, J., and Bansal, S., 2013. A survey on association rule mining in cloud computing. IJETAE 3, (4), 318--321.Google ScholarGoogle Scholar
  46. Shah, K.D. and Mahajan, S., 2009. Maximizing the Efficiency of Parallel Apriori Algorithm. In Proceedings of the International Conference on Advances in Recent Technologies in Communication and Computing, 107--109. DOI= http://dx.doi.org/10.1109/ARTCom.2009.73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N., 2000. Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Explor. Newsl. 1, (2), 12--23. DOI= http://dx.doi.org/10.1145/846183.846188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O'Malley, O., Radia, S., Reed, B., and Baldeschwieler, E., 2013. Apache Hadoop YARN: yet another resource negotiator. In Proceedings of the Proceedings of the 4th annual Symposium on Cloud Computing (Santa Clara, California2013), ACM, 2523633, 1--16. DOI= http://dx.doi.org/10.1145/2523616.2523633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wu, K., Otoo, E., and Shoshani, A., 2006. Optimizing Bitmap Indices with Efficient Compression. ACM T. DATABASE SYST. 31, (1), 1--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yahya, O., Hegazy, O., and Ezat, E., 2012. An Efficient Implementation of Apriori Algorithm Based on Hadoop-Mapreduce Model. IJRIC 12, (7), 59--67.Google ScholarGoogle Scholar
  51. Yanbin, Y. and Chia-Chu, C., 2006. A Parallel Apriori Algorithm for Frequent Itemsets Mining. In Proceedings of the 4th International Conference on Software Engineering Research, Management and Applications (SERA'06), 87--94. DOI= http://dx.doi.org/10.1109/SERA.2006.6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Yang, X.Y., Liu, Z., and Fu, Y., 2010. MapReduce as a programming model for association rules algorithm on Hadoop. In Proceedings of the 3rd International Conference on Information Sciences and Interaction Sciences IEEE, 99--102.Google ScholarGoogle Scholar
  53. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., and Stoica, I., 2010. Spark: cluster computing with working sets. In Proceedings of the Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (Boston, MA2010), USENIX Association, 1863113, 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Zaki, M.J., 1999. Parallel and Distributed Association mining: A survey. IEEE CONCURRENCY 7, (4), 14--25. DOI= http://dx.doi.org/10.1109/4434.806975. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    INFOS '16: Proceedings of the 10th International Conference on Informatics and Systems
    May 2016
    347 pages
    ISBN:9781450340625
    DOI:10.1145/2908446

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 May 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader