Skip to main content

A MapReduce-Based Distributed SVM for Scalable Data Type Classification

  • Conference paper
  • First Online:
Collaborate Computing: Networking, Applications and Worksharing (CollaborateCom 2016)

Abstract

Data type classification is a significant problem in digital forensics and information security field. Methods based on support vector machine have proven the most successful across varying classification approaches in the previous work. However, the training process of SVM is notably computationally intensive with the number of training vectors increased rapidly. In this study, we proposed parallel distributed SVM (PDSVM) based on Hadoop MapReduce for scalable data type classification. First the map phase determines support vectors (SVs) in the splits of dataset by running the sequential minimal optimization. Then the reduce phase merges SVs and computes the degree of global convergence. Finally, PDSVM utilizes the global convergence SVs to get SVM model. The experimental results demonstrate that PDSVM can not only process large scale training dataset, but also perform well in the term of classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The physiology of the grid: an open gridservices architecture for distributed systems integration. Technical report, Global Grid

    Google Scholar 

  2. Zheng, N., Wang, J., Wu, T., et al.: A fragment classification method depending on data type. In: IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing. IEEE (2015)

    Google Scholar 

  3. Beebe, N.L., Maddox, L.A., Liu, L., et al.: Sceadan: using concatenated n-gram vectors for improved file and data type classification. IEEE Trans. Inf. Forensics Secur. 8(9), 1519–1530 (2013)

    Article  Google Scholar 

  4. Erbacher, R.F., Mulholland J.: Identification and localization of data types within large-scale file systems. In: International Workshop on Systematic Approaches to Digital Forensic Engineering, pp. 55–70. IEEE Computer Society (2007)

    Google Scholar 

  5. Beek, H.M.A.V., Eijk, E.J.V., Baar, R.B.V., et al.: Digital forensics as a service: game on. Digital Invest. 15, 20–38 (2015)

    Article  Google Scholar 

  6. Fitzgerald, S., Mathews, G., Morris, C., et al.: Using NLP techniques for file fragment classification. Digital Invest. 9(15), S44–S49 (2012)

    Article  Google Scholar 

  7. Xu, K., Wen, C., Yuan, Q., et al.: A MapReduce based parallel SVM for email classification. J. Networks, 9(6) (2014)

    Google Scholar 

  8. Ke, X., Jin, H., Xie, X., et al.: A distributed SVM method based on the iterative MapReduce. In: IEEE International Conference on Semantic Computing (ICSC), pp. 116–119. IEEE Computer Society (2015)

    Google Scholar 

  9. Çatak, F.Ö.: Polarization measurement of high dimensional social media messages with support vector machine algorithm using MapReduce (2015)

    Google Scholar 

  10. Guo, W., Alham, N.K., Liu, Y., et al.: A resource aware MapReduce based parallel SVM for large scale image classifications. Neural Process. Lett., 1–24 (2015)

    Google Scholar 

  11. Na, G., Shim, K., Moon, K., Kong, S., Kim, E., Lee, J.: Frame-based recovery of corrupted video files using codec specifications. IEEE Trans. Image Process. 23(2), 517–526 (2014)

    Article  MathSciNet  Google Scholar 

  12. Moody, S.J., Erbacher, R.F.: SÁDI - statistical analysis for data type identification. In: International Workshop on Systematic Approaches to Digital Forensic Engineering, SADFE 2008, Berkeley, California, USA, May, pp. 41–54 (2008)

    Google Scholar 

  13. Zhang, L., White, G.B.: An approach to detect executable content for anomaly based network intrusion detection. In: 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26–30 March 2007, Long Beach, California, USA, pp. 1–8 (2007)

    Google Scholar 

  14. Amirani, M.C., Toorani, M., Mihandoost, S.: Feature-based type identification of file fragments. Secur. Commun. Networks 6(1), 115–128 (2013)

    Article  Google Scholar 

  15. Amirani, M.C, Toorani, M., Beheshti, A.: A new approach to content-based file type detection. In: Computer Science, pp. 1103–1108 (2008)

    Google Scholar 

  16. Li, Q., Ong, A., Suganthan, P., et al.: A novel support vector machine approach to high entropy data fragment classification (2010)

    Google Scholar 

  17. Hazan, T., Man, A., Shashua, A.: A parallel decomposition solver for SVM: distributed dual ascend using fenchel duality, pp. 1–8 (2008)

    Google Scholar 

  18. Do, T.N., Poulet, F.: Classifying one billion data with a new distributed SVM algorithm. In: International Conference on Research, Innovation and Vision for the Future, pp. 59–66 (2006)

    Google Scholar 

  19. Chang, E.Y., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z.: PSVM: parallelizing support vectormachines on distributed computers. In: Proceedings of Advances in Neural Information Processing Systems, pp. 257–264 (2007)

    Google Scholar 

  20. Zhu-Hong, Y., Jian-Zhong, Y., Lin, Z., Shuai, L., Zhen-Kun, W.: A MapReduce based parallel SVM for large-scale predicting protein-protein interactions. Neurocomputing 145, 37–43 (2014)

    Article  Google Scholar 

  21. Guo, W., Alham, N.K., Liu, Y., et al.: A resource aware MapReduce based parallel SVM for large scale image classifications. Neural Process. Lett., 1–24 (2005)

    Google Scholar 

  22. Graf, H., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallel support vectormachines: the cascade SVM. In: Proceedings of Advances in Neural Information Processing Systems (NIPS) (2004)

    Google Scholar 

  23. Sun, Z., Fox, G.: Study on Parallel SVM Based on MapReduce (2013)

    Google Scholar 

  24. Çatak, F.O., Balaban, M.E.: CloudSVM: training an SVM classifier in cloud computing systems. In: Proceedings of the Pervasive Computing and the Networked World—Joint International Conference (ICPCA/SWS), pp. 57–68 (2012)

    Google Scholar 

  25. Platt, J.: Sequential minimal optimization: a fast algorithm for training support vector machines. Technical report, MSR-TR-98-14, Microsoft Research (1998)

    Google Scholar 

  26. Fan, R.E., Chang, K.W., Hsieh, C.J., et al.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9(9), 1871–1874 (2008)

    MATH  Google Scholar 

  27. Kun, D., Yih, L., Perera, A.: Parallel SMO for training support vector machines, SMA 5505, project final report (2003)

    Google Scholar 

  28. Apache Hadoop. http://hadoop.apache.org

  29. Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43 (2003)

    Google Scholar 

Download references

Acknowledgments

This work is support by Natural Science Foundation of China under Grant No. 61070212 and 61572165, the State Key Program of Zhejiang Province Natural Science Foundation of China under Grant No. LZ15F020003 and Key Lab of Information Network Security of Ministry of Public Security.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Cite this paper

Jiang, C., Wu, T., Xu, J., Zheng, N., Xu, M., Yang, T. (2017). A MapReduce-Based Distributed SVM for Scalable Data Type Classification. In: Wang, S., Zhou, A. (eds) Collaborate Computing: Networking, Applications and Worksharing. CollaborateCom 2016. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 201. Springer, Cham. https://doi.org/10.1007/978-3-319-59288-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59288-6_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59287-9

  • Online ISBN: 978-3-319-59288-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics