Abstract
Hadoop framework is one of the reliable, scalable framework for the big data analytics. In this paper we investigate the Hadoop framework for distributed data mining to reduce the computational cost for the exponentially growing scientific data. We use the RIPPER (Repeated Incremental Pruning for Error Reduction) algorithm [5] to develop a rule based classifier. We propose a parallel implementation of RIPPER based on the Hadoop MapReduce framework. The data is horizontally partitioned so that each node operates on a portion of the dataset and finally the results are aggregated to develop the classifier. We tested our algorithm on two large datasets and results showed that we can achieve a speed up of as high as 3.7 on 4 nodes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Apache hadoop, http://hadoop.apache.org/
Sloan Digital Sky Survey Data Release 10, http://skyserver.sdss3.org/dr10/en/home.aspx
Basu, S., Kumaravel, A.: Classification by rules mining model with map- reduce framework in cloud. International Journal of Advanced and Innovative Research 2, 403–409 (2013)
Borthakur, D.: The hadoop distributed file system: Architecture and design. Hadoop Project Website (2007)
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the 12th International Conference on Machine Learning (ICML 1995), pp. 115–123 (1995)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Dean, J., Ghemawat, S.: MapReduce: A flexible data processing tool. Communications of the ACM 53(1), 72–77 (2010)
Ishibuchi, H., Yamane, M., Nojima, Y.: Ensemble fuzzy rule-based classifier design by parallel distributed fuzzy gbml algorithms. In: Bui, L.T., Ong, Y.S., Hoai, N.X., Ishibuchi, H., Suganthan, P.N. (eds.) SEAL 2012. LNCS, vol. 7673, pp. 93–103. Springer, Heidelberg (2012)
Mackey, G., Sehrish, S., Bent, J., Lopez, J., Habib, S., Wang, J.: Introducing map-reduce to high end computing. In: 3rd Petascale Data Storage Workshop, PDSW 2008. 3rd, pp. 1–6 (2008)
Nguyen, T.-C., Shen, W.-F., Chai, Y.-H., Xu, W.-M.: Research and implementation of scalable parallel computing based on map-reduce. Journal of Shanghai University (English Edition) 15(5), 426–429 (2011)
Qin, B., Xia, Y., Prabhakar, S., Tu, Y.-C.: A rule-based classification algorithm for uncertain data. In: Ioannidis, Y.E., Lee, D.L., Ng, R.T. (eds.) ICDE, pp. 1633–1640. IEEE (2009)
Zhou, L., Wang, H., Wang, W.: Parallel implementation of classification algorithms based on cloud computing environment. Indonesian Journal of Electrical Engineering 10(5), 1087–1092 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gugnani, S., Khanolkar, D., Bihany, T., Khadilkar, N. (2014). Rule Based Classification on a Multi Node Scalable Hadoop Cluster. In: Fortino, G., Di Fatta, G., Li, W., Ochoa, S., Cuzzocrea, A., Pathan, M. (eds) Internet and Distributed Computing Systems. IDCS 2014. Lecture Notes in Computer Science, vol 8729. Springer, Cham. https://doi.org/10.1007/978-3-319-11692-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-11692-1_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11691-4
Online ISBN: 978-3-319-11692-1
eBook Packages: Computer ScienceComputer Science (R0)