Abstract
A huge amount of sensing data are generated by a large number of pervasive IoT devices. In order to find meaningful information from the big data, it is essential to perform pre-processing, in which many outlier data points need to be removed, because they deteriorate as time passes. Although pre-processing is essential in the big data field, there has been a significant lack of research works with case studies. In this paper, big data pre-processing methods are investigated and proposed. To evaluate the pre-processing methods for accurate analysis, we used a collection of digital tachograph (DTG) data. We obtained DTG sensing data of 6198 driving vehicles over a year. We studied five kinds of pre-processing methods: filtering ranges, excluding meaningless values, comparing filters from variables, applying statistical techniques, and finding driving patterns. In addition, we developed a MapReduce program using a Hadoop ecosystem and deployed big data to perform the pre-processing analysis. Through the pre-processing steps, we confirmed that the proportion of DTG sensing data points including any errors was up to 27.09%. Compared to the traditional brute-force way to detect, ours had 71.1% additional detection effect. In addition, we confirmed that outlier data points, which are difficult to detect through simple range error pre-processing, could be well detected.








We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
References
Souza AM, Amazonas JR (2015) An outlier detect algorithm using big data processing and internet of things architecture. Proced Comput Sci 52:1010–1015
Zhang Y, Meratnia N, Havinga P (2010) Outlier detection techniques for wireless sensor networks: a survey. IEEE Commun Surv Tutor 12(2):159–170
Govindarajan M, Abinaya V (2014) An outlier detection approach with data mining in wireless sensor network. Int J Curr Eng Technol 4(2):929–932
Wang C, Lin H, Jiang H (2014) Trajectory-based multi-dimensional outlier detection in wireless sensor networks using Hidden Markov Models. Wirel Netw 20(8):2409–2418
Atzori L, Iera A, Morabito G (2010) The internet of things: a survey. Comput Netw 54(15):2787–2805
Lee SJ, Lee C (2012) Short-Term Impact Analysis of DTG Installation for Commercial Vehicles. J Korea Inst Intell Transp Syst 11(6):49–59
Kang JG, Kim YW, Lim UT, Jun MS (2013) An improved vehicle data format of digital tachograph. J Korea Soc Comput Inform 18(8):77–85
Park J, Joh G, Park J (2015) Study on reliability of new digital tachograph for traffic accident investigation and reconstruction. Transact Korean Soc Automot Eng 23(6):615–622
White T (2012) Hadoop: the definitive guide. O’Reilly Media, Inc., Newton
Lai WK, Chen YU, Wu TY, Obaidat MS (2014) Towards a framework for large-scale multimedia data storage and processing on Hadoop platform. J Supercomput 68(1):488–507
Lam C (2010) Hadoop in action. Manning Publications Co., Greenwich
Hadoop Homepage, (2015) [Online] Available: https://hadoop.apache.org/
Lee KH, Lee YJ, Choi H, Chung YD, Moon B (2012) Parallel data processing with MapReduce: a survey. ACM Sigmod Rec 40(4):11–20
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259
He Q, Tan Q, Ma X, Shi Z (2010, October). The high-activity parallel implementation of data preprocessing based on MapReduce. In: International Conference on Rough Sets and Knowledge Technology. Springer, Berlin, pp. 646–654
Pan Y, Zhang J (2012) Parallel programming on cloud computing platforms–challenges and solutions. J Converg 3(4):23–28
Mafrur R, Nugraha IGD, Choi D (2015) Modeling and discovering human behavior from smartphone sensing life-log data for identification purpose. Hum Cent Comput Inform Sci 5(1):1
Toledo RY, Mota YC, Borroto MG (2013) A regularity-based preprocessing method for collaborative recommender systems. J Inf Process Syst 9(3):435–460
Cho W, Choi E (2015) A GPS trajectory map-matching mechanism with DTG big data on the HBase system. In: The Proceedings 2015 International Conference on Big Data Applications and Services
Cho W, Choi E (2015) Rural traffic map coverage extension using DTG Big data processing. J Inform Technol Archit 12:51–57
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Vilaça A, Aguiar A, Soares C (2015) Estimating fuel consumption from GPS data. In: Iberian Conference on Pattern Recognition and Image Analysis. Springer International Publishing, pp 672–682
Acknowledgements
This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the IT/SW Creative research program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2013-H0502-13-1071).
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Cho, W., Choi, E. Big data pre-processing methods with vehicle driving data using MapReduce techniques. J Supercomput 73, 3179–3195 (2017). https://doi.org/10.1007/s11227-017-2014-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2014-x