Abstract
Data preprocessing is an important and basic technique for data mining and machine learning. Due to the dramatic increasing of information, traditional data preprocessing techniques are time-consuming and not fit for processing mass data. In order to tackle this problem, we present parallel data preprocessing techniques based on MapReduce which is a programming model to implement parallelization easily. This paper gives the implementation details of the techniques including data integration, data cleaning, data normalization and so on. The proposed parallel techniques can deal with large-scale data (up to terabytes) efficiently. Our experimental results show considerable speedup performances with an increasing number of processors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Han, J.W., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)
Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Tsinghua University Press, Beijing (2005)
Jian, Z.G., Jin, X.: Research on Data Preprocess in Data Mining and Its Application. Application Research of Computers 7, 117–118, 157 (2004)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51, 107–113 (2008)
Borthakur, D.: The Hadoop Distributed File System: Architecture and Design (2007)
Lammel, R.: Google’s MapReduce Programming Model – Revisited. Science of Computer Programming 70, 1–30 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
He, Q., Tan, Q., Ma, X., Shi, Z. (2010). The High-Activity Parallel Implementation of Data Preprocessing Based on MapReduce. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds) Rough Set and Knowledge Technology. RSKT 2010. Lecture Notes in Computer Science(), vol 6401. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16248-0_88
Download citation
DOI: https://doi.org/10.1007/978-3-642-16248-0_88
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16247-3
Online ISBN: 978-3-642-16248-0
eBook Packages: Computer ScienceComputer Science (R0)