Abstract
The data analysis is closely related to data attribute dimension. The traditional extraction and partition of data attribute dimension is so manual and inefficiency as to not meet the needs of analysing big data. This paper proposed an attribute dimension partition scheme based on SVM classifying and MapReduce for analysing big data. This scheme improve traditional SVM classifying method by combining Euclidean distance theory for overcoming its disadvantages, and adopts punish coefficient to reduce the unbalance of data distribution. With the improved SVM classifying method, the implementation of attribute dimension partition take MapReduce model of Hadoop as process engine, use TF–IDF vector to save the extracted attribute dimension, and use k-means clustering algorithm to clustering partition. The experiment result shows that the execution efficiency of the proposed method is enhanced, and while the rationality of partition is guaranteed, the increasing of data attributes does not significantly increase the execution time.









Similar content being viewed by others
References
Wang, H., Qin, X., Wang, S., et al. (2015). Scalable OLAP queries processing towards large cluster. Chinese Journal of Computer, 38(1), 45–58.
van der Aalst, W. M. (2013). Process cubes: Slicing, dicing, rolling up and drilling down event data for process mining. Lecture Notes in Business Information Processing, 159, 1–22.
Huser, V. (2012). Process mining: Discovery, conformance and enhancement of business processes. Journal of Biomedical Informatics, 45(5), 1018–1019.
Archana, S., & Elangovan, K. (2014). Survey of classification techniques in data mining. International Journal of Computer Science and Mobile Applications, 2(2), 65–71.
Wu, H. C., Luk, R. W. P., Wong, K. F., et al. (2008). Interpreting TF–IDF term weights as making relevance decisions. ACM Transactions on Information Systems, 26(3), 55–59.
Patil, T. R., & Sherekar, S. (2013). Performance analysis of naive Bayes and J48 classification algorithm for data classification. International Journal of Computer Science and Applications, 6(2), 256–261.
Abeen, F., Khusro, S., Majid, A., et al. (2014). Semantics discovery in social tagging systems: a review. Multimedia Tools and Applications, 75(1), 1–33.
Askan, A., & Saym, S. (2014). SVM classification for imbalanced data sets using a multiobjective optimization framework. Annals of Operations Research, 216(1), 191–203.
Bijalwan, V., Kumar, V., Kumar, P., et al. (2014). KNN based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61–70.
Yoo, J. Y., & Yang, D. (2015). Classification scheme of unstructured text document using TF–IDF and naive Bayes classifier. Advanced Science and Technology Letters, 111(50), 263–266.
Annasaheb, A. B., & Verma, V. K. (2016). Data mining classification techniques: A recent survey. International Journal of Emerging Technologies in Engineering Research, 4(8), 51–54.
Akinola, S. O., & Oyabugbe, O. J. (2015). Accuracies and training times of data mining classification algorithms: An empirical comparative study. Journal of Software Engineering and Applications, 8(9), 470–477.
Sujatha, R., & Ezhilmaran, D. (2016). Performance analysis of data mining classification techniques for chronic kidney disease. International Journal of Pharmacy and Technology, 8(2), 12032–13037.
Subaira, A. S., & Anitha, P. (2013). Efficient classification mechanism for network intrusion detection system based on data mining techniques: A survey. International Journal of Computer Science and Mobile Computing, 2(10), 274–280.
Perveen, S., Shahbaz, M., Guergachi, A., & Keshavjee, K. (2016). Performance analysis of data mining classification techniques to predict diabetes. Procedia Computer Science, 82, 115–121.
Mateus, R. C., Siqueira, T. L. L., Times, V. C., et al. (2016). Spatial data warehouses and spatial OLAP come towards the cloud: Design and performance. Distributed and Parallel Databases, 34(3), 425–461.
Zhao, W., & Zhao, Z. (2012). Research on engineering software data formats conversion network. Journal of Software, 7(11), 2606–2613.
Beheshti, S. M. R., & Benatallah, B. (2016). Scalable graph-based OLAP analytics over process execution data. Distributed and Parallel Databases, 34(3), 379–423.
Pokorny, J. (2013). NoSQL databases: A step to database scalability in web environment. International Journal of Web Information Systems, 9(1), 278–283.
Nikhil, N., & Kulkarni, R. B. (2015). Appraisal management system using data mining classification technique. International Journal of Computer Applications, 136(12), 45–58.
Zhao, W., Fan, T., & Wang, H. (2017). Research on data security mechanism among cloud services based on software define network. International Journal of Security and its Application, 11(1), 35–44.
Suma, V. R., Renjith, S., Ashok, S., & Judy, M. V. (2016). Analytical study of selected classification algorithms for clinical dataset. Indian Journal of Science and Technology, 9(11), 1–9.
Acknowledgements
The authors acknowledge the National Natural Science Foundation of China (Grant No. 61373160), the Standardization Processing and Application System Development of Science and Technology’s Big Data (Grant No. 17210113D), and Science and Technology Resource Survey, Statistical Analysis and System Development (Grant No. 179676334D).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, W., Fan, T., Nie, Y. et al. Research on Attribute Dimension Partition Based on SVM Classifying and MapReduce. Wireless Pers Commun 102, 2759–2774 (2018). https://doi.org/10.1007/s11277-018-5301-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-018-5301-9