The High-Activity Parallel Implementation of Data Preprocessing Based on MapReduce

He, Qing; Tan, Qing; Ma, Xudong; Shi, Zhongzhi

doi:10.1007/978-3-642-16248-0_88

Qing He²⁴,
Qing Tan^24,25,
Xudong Ma^24,25 &
…
Zhongzhi Shi²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6401))

Included in the following conference series:

International Conference on Rough Sets and Knowledge Technology

1072 Accesses
5 Citations

Abstract

Data preprocessing is an important and basic technique for data mining and machine learning. Due to the dramatic increasing of information, traditional data preprocessing techniques are time-consuming and not fit for processing mass data. In order to tackle this problem, we present parallel data preprocessing techniques based on MapReduce which is a programming model to implement parallelization easily. This paper gives the implementation details of the techniques including data integration, data cleaning, data normalization and so on. The proposed parallel techniques can deal with large-scale data (up to terabytes) efficiently. Our experimental results show considerable speedup performances with an increasing number of processors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, J.W., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Tsinghua University Press, Beijing (2005)
Google Scholar
Jian, Z.G., Jin, X.: Research on Data Preprocess in Data Mining and Its Application. Application Research of Computers 7, 117–118, 157 (2004)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51, 107–113 (2008)
Google Scholar
http://hadoop.apache.org/core/
Borthakur, D.: The Hadoop Distributed File System: Architecture and Design (2007)
Google Scholar
Lammel, R.: Google’s MapReduce Programming Model – Revisited. Science of Computer Programming 70, 1–30 (2008)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Qing He, Qing Tan, Xudong Ma & Zhongzhi Shi
Graduate University of Chinese Academy of Sciences, Beijing, 100190, China
Qing Tan & Xudong Ma

Authors

Qing He
View author publications
You can also search for this author in PubMed Google Scholar
Qing Tan
View author publications
You can also search for this author in PubMed Google Scholar
Xudong Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzhi Shi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, 100044, Beijing, China
Jian Yu
Faculty of Economics, University of Catania, Corso Italia, 55, 95129, Catania, Italy
Salvatore Greco
Department of Mathematics and Computing Science, Saint Mary’s University, B3H 3C3, Halifax, Nova Scotia, Canada
Pawan Lingras
Institute of Computer Science and Technology, Chongqing University of Posts and Telecommunications, 400065, Chongqing, China
Guoyin Wang
Institute of Mathematics, Warsaw University, Banacha 2, 02-097, Warsaw, Poland
Andrzej Skowron

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Q., Tan, Q., Ma, X., Shi, Z. (2010). The High-Activity Parallel Implementation of Data Preprocessing Based on MapReduce. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds) Rough Set and Knowledge Technology. RSKT 2010. Lecture Notes in Computer Science(), vol 6401. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16248-0_88

Download citation

DOI: https://doi.org/10.1007/978-3-642-16248-0_88
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16247-3
Online ISBN: 978-3-642-16248-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics