Abstract
Big data application performance can be optimized by identifying the most impactful set of system parameters of big data platforms. This paper focuses on the identification of optimal system parameter set of Hadoop and Spark data platforms by applying different feature selection techniques. The main objective of the research work is to reduce the job execution time by identifying and tuning only these identified system parameters. The parameters deemed to be less relevant and redundant get eliminated during the feature selection process. The parameters identified using different feature selection algorithms are compared, and empirical analysis is carried. The statistical analysis is used as a cross-validation technique to evaluate the relevance of the identified parameter set and the dependency of platform performance on system parameters.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Kamtekar, K., Jain R.: Performance Modeling of BigData—The Art of Computer Systems Performance Analysis: Techniquesfor Experimental Design, Measurement, Simulation, and Modeling. Wiley Interscience, New York. ISBN: 0471503363 (1991)
Jagadish, H.V., Labrinidis, A.: Challenges and opportunities with big data. ACM 5(12), 2022–2023 (2012)
Chen, X., Liang, Y., Li, G.R., Chen, C., Liu, S.Y.: Optimizing performance of Hadoop with parameter tuning. ITM Web of Conferences 12, 30–40 (2017)
Hua, X., Huang, M.C., Liu, P.: Hadoop configuration tuning with ensemble modeling and metaheuristic optimization. IEEE Access 6, 44161–44174 (2018)
Khaleel, A., Al-Raweshidy, H.: Optimization of computing and networking resources of a Hadoop cluster based on software defined network. IEEE Access 6, 61351–61365 (2018)
Palanisamy, B., Singh, A., Liu, L.: Cost-effective resource provisioning for mapreduce in a cloud. IEEE Trans. Parallel Distrib. Syst. 26(5), 1265–1279 (2015)
Arauzo-Azofra, A., Benitez, J.M., Castro, J.L.: A feature set measure based on relief. In: Proceedings of the Fifth International Conference on Recent Advances in Soft Computing, pp. 104–109 (2004)
Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: IEEE, 18th International Conference on High Performance Computing and Communications, pp. 586–593 (2016)
Prasad, B.R, Agarwal, S.: Performance analysis and optimization of spark streaming applications through effective control parameters tuning. In: Intelligent Computing Techniques: Theory, Practice, and Applications, pp. 99–110. Springer, Singapore (2018)
Jamshidi, P., Casale, G.: An uncertainty-aware approach to optimal configuration of stream processing systems. In: IEEE, 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 39–48 (2016)
Aldor-Noiman, S., Brown, L.D., Buja, A., Rolke, W., Stine, R.A.: The power to see: a new graphical test of normality. Am. Stat. 67(4), 249–260 (2013)
Ghasemi, A., Zahediasl, S.: Normality tests for statistical analysis: a guide for non-statisticians. Int. J. Endocrinol. Metab. 10(2), 486 (2012)
Razali, N.M., Wah, Y.B.: Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. J. Stat. Model. Anal. 2(1), 21–33 (2011)
Das, K.R., Imon, A.H.M.R.: A brief review of tests for normality. Am. J. Theor. Appl. Stat. 5(1), 5–12 (2016)
Yap, B.W., Sim, C.H.: Comparisons of various types of normality tests. J. Stat. Comput. Simul. 81(12), 2141–2155 (2011)
Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: INNS Conference on Big Data, pp. 226–237. Springer, Berlin (2016)
Park, N.J., George, K.M., Park, N.: A multiple regression model for trend change prediction. In: International Conference on Financial Theory and Engineering, pp. 22–26. IEEE (2010)
Feng, Q., Zhu, Q., Yuan, C., Lee, I.: Multi-linear regression coefficient classifier for recognition. In: IEEE Congress on Evolutionary Computation, pp. 1382–1387 (2016)
Pattanshetti, T., Attar, V.: Unsupervised feature selection using correlation score. In: Computing, Communication and Signal Processing, pp. 355–362. Springer, Singapore (2019)
Pattanshetti, T., Attar, V.: Mean Based Robust Multilinear Regression for Feature Selection (2019 Accepted)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pattanshetti, T., Attar, V. (2021). Performance Optimization of Big Data Applications Using Parameter Tuning of Data Platform Features Through Feature Selection Techniques. In: Bhateja, V., Peng, SL., Satapathy, S.C., Zhang, YD. (eds) Evolution in Computational Intelligence. Advances in Intelligent Systems and Computing, vol 1176. Springer, Singapore. https://doi.org/10.1007/978-981-15-5788-0_26
Download citation
DOI: https://doi.org/10.1007/978-981-15-5788-0_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-5787-3
Online ISBN: 978-981-15-5788-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)