Abstract
Data type classification is a significant problem in digital forensics and information security field. Methods based on support vector machine have proven the most successful across varying classification approaches in the previous work. However, the training process of SVM is notably computationally intensive with the number of training vectors increased rapidly. In this study, we proposed parallel distributed SVM (PDSVM) based on Hadoop MapReduce for scalable data type classification. First the map phase determines support vectors (SVs) in the splits of dataset by running the sequential minimal optimization. Then the reduce phase merges SVs and computes the degree of global convergence. Finally, PDSVM utilizes the global convergence SVs to get SVM model. The experimental results demonstrate that PDSVM can not only process large scale training dataset, but also perform well in the term of classification accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The physiology of the grid: an open gridservices architecture for distributed systems integration. Technical report, Global Grid
Zheng, N., Wang, J., Wu, T., et al.: A fragment classification method depending on data type. In: IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing. IEEE (2015)
Beebe, N.L., Maddox, L.A., Liu, L., et al.: Sceadan: using concatenated n-gram vectors for improved file and data type classification. IEEE Trans. Inf. Forensics Secur. 8(9), 1519–1530 (2013)
Erbacher, R.F., Mulholland J.: Identification and localization of data types within large-scale file systems. In: International Workshop on Systematic Approaches to Digital Forensic Engineering, pp. 55–70. IEEE Computer Society (2007)
Beek, H.M.A.V., Eijk, E.J.V., Baar, R.B.V., et al.: Digital forensics as a service: game on. Digital Invest. 15, 20–38 (2015)
Fitzgerald, S., Mathews, G., Morris, C., et al.: Using NLP techniques for file fragment classification. Digital Invest. 9(15), S44–S49 (2012)
Xu, K., Wen, C., Yuan, Q., et al.: A MapReduce based parallel SVM for email classification. J. Networks, 9(6) (2014)
Ke, X., Jin, H., Xie, X., et al.: A distributed SVM method based on the iterative MapReduce. In: IEEE International Conference on Semantic Computing (ICSC), pp. 116–119. IEEE Computer Society (2015)
Çatak, F.Ö.: Polarization measurement of high dimensional social media messages with support vector machine algorithm using MapReduce (2015)
Guo, W., Alham, N.K., Liu, Y., et al.: A resource aware MapReduce based parallel SVM for large scale image classifications. Neural Process. Lett., 1–24 (2015)
Na, G., Shim, K., Moon, K., Kong, S., Kim, E., Lee, J.: Frame-based recovery of corrupted video files using codec specifications. IEEE Trans. Image Process. 23(2), 517–526 (2014)
Moody, S.J., Erbacher, R.F.: SÁDI - statistical analysis for data type identification. In: International Workshop on Systematic Approaches to Digital Forensic Engineering, SADFE 2008, Berkeley, California, USA, May, pp. 41–54 (2008)
Zhang, L., White, G.B.: An approach to detect executable content for anomaly based network intrusion detection. In: 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26–30 March 2007, Long Beach, California, USA, pp. 1–8 (2007)
Amirani, M.C., Toorani, M., Mihandoost, S.: Feature-based type identification of file fragments. Secur. Commun. Networks 6(1), 115–128 (2013)
Amirani, M.C, Toorani, M., Beheshti, A.: A new approach to content-based file type detection. In: Computer Science, pp. 1103–1108 (2008)
Li, Q., Ong, A., Suganthan, P., et al.: A novel support vector machine approach to high entropy data fragment classification (2010)
Hazan, T., Man, A., Shashua, A.: A parallel decomposition solver for SVM: distributed dual ascend using fenchel duality, pp. 1–8 (2008)
Do, T.N., Poulet, F.: Classifying one billion data with a new distributed SVM algorithm. In: International Conference on Research, Innovation and Vision for the Future, pp. 59–66 (2006)
Chang, E.Y., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z.: PSVM: parallelizing support vectormachines on distributed computers. In: Proceedings of Advances in Neural Information Processing Systems, pp. 257–264 (2007)
Zhu-Hong, Y., Jian-Zhong, Y., Lin, Z., Shuai, L., Zhen-Kun, W.: A MapReduce based parallel SVM for large-scale predicting protein-protein interactions. Neurocomputing 145, 37–43 (2014)
Guo, W., Alham, N.K., Liu, Y., et al.: A resource aware MapReduce based parallel SVM for large scale image classifications. Neural Process. Lett., 1–24 (2005)
Graf, H., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallel support vectormachines: the cascade SVM. In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2004)
Sun, Z., Fox, G.: Study on Parallel SVM Based on MapReduce (2013)
Çatak, F.O., Balaban, M.E.: CloudSVM: training an SVM classifier in cloud computing systems. In: Proceedings of the Pervasive Computing and the Networked World—Joint International Conference (ICPCA/SWS), pp. 57–68 (2012)
Platt, J.: Sequential minimal optimization: a fast algorithm for training support vector machines. Technical report, MSR-TR-98-14, Microsoft Research (1998)
Fan, R.E., Chang, K.W., Hsieh, C.J., et al.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9(9), 1871–1874 (2008)
Kun, D., Yih, L., Perera, A.: Parallel SMO for training support vector machines, SMA 5505, project final report (2003)
Apache Hadoop. http://hadoop.apache.org
Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43 (2003)
Acknowledgments
This work is support by Natural Science Foundation of China under Grant No. 61070212 and 61572165, the State Key Program of Zhejiang Province Natural Science Foundation of China under Grant No. LZ15F020003 and Key Lab of Information Network Security of Ministry of Public Security.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Jiang, C., Wu, T., Xu, J., Zheng, N., Xu, M., Yang, T. (2017). A MapReduce-Based Distributed SVM for Scalable Data Type Classification. In: Wang, S., Zhou, A. (eds) Collaborate Computing: Networking, Applications and Worksharing. CollaborateCom 2016. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 201. Springer, Cham. https://doi.org/10.1007/978-3-319-59288-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-59288-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59287-9
Online ISBN: 978-3-319-59288-6
eBook Packages: Computer ScienceComputer Science (R0)