research-article

Performance Evaluation of Apache Spark According to the Number of Nodes using Principal Component Analysis

Authors:
Sungjin Hong

Dept of Business Data Convergence, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea

Dept of Business Data Convergence, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea
View Profile

,
Sangho Kim

Dept of Business Data Convergence, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea

Dept of Business Data Convergence, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea
View Profile

,
Jongsun Jang

Dept of Business Data Convergence, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea

Dept of Business Data Convergence, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea
View Profile

,
Chi-hwan Choi

Department of Bio-Information Technology, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea

Department of Bio-Information Technology, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea
View Profile

,
In-sun Jung

Department of Management Information Systems, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea

Department of Management Information Systems, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea
View Profile

,
Jonghwa Na

Department of Information & Statistics, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea

Department of Information & Statistics, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea
View Profile

,
Wan-Sup Cho

Department of MIS/Business Data Convergence, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea

Department of MIS/Business Data Convergence, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea
View Profile

,
Su-young Chi

Electronics and Telecommunications Research Institute, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea, Yuseong-gu, Daejeon, Korea

Electronics and Telecommunications Research Institute, Chungbuk National University, Gaesin-dong, Seowon-gu, Cheongju-si, Chungcheongbuk-do, Korea, Yuseong-gu, Daejeon, Korea
View Profile

BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and ServicesOctober 2015Pages 98–103https://doi.org/10.1145/2837060.2837074

Published:20 October 2015Publication History

BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and Services

Pages 98–103

ABSTRACT

With the development of big data collection and storage technology, an analysis for its utilization has recently been expanded in public sector and various industries. Especialy in manufacturing and financial sectors, there has been a very high demand for real-time analysis of big data. Existing studies on the big data analysis mainly dealt with its batch scheme as a premise. In recent years, studies related to real-time analytics using SPARK, STORM and IMDG have been underway. In this regard, this paper seeks to evaluate the processing performance of the principal component analysis using an open sourse SPARK which is in-memeory based distributed processing method. It is necessary for real-time analysis and fast operation of large amount of data. This paper shows how fast spark is by comprison with open source R and also investigate the distributed processing capability of Spark according to the Node configuration.

References

Jay Lee, Hung-An Kao, Shanhu Yang: Service innovation and smart analytics for Industry 4.0 and big data environment. Procedia CIRP 16 (2014) 3--8Google Scholar
Jay Lee, Edzel Lapira, Beharad Bagherim Hung-an Kao: Recent advances and trends in predictive manufacturing systems in big data environment. Manufacturing Letters 1 (2013) 38--41Google Scholar
Girma Kejela, Rui Maximo Esteves and Chunming Rong: Predictive Analytics of Sensor Data Using Distributed Machine Learning Technique. 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, DOI 10.1109 (2014) Google ScholarDigital Library
Yunhee Kang:Open-source distributed data processing framework for Bigdata Trend, http://www.oss.kr/, 2014.07.09Google Scholar
Kijun Lee: The Comparison Between Hadoop MapReduce and Spark Device's Machine Learning Performance, soongsil university, 2014.12Google Scholar
Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Mohamed Hefeeda: sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms, arXiv: 1503.05214, 2015.03.17Google Scholar
Hyuk Lee: Rank-Sparsity based signal processing techniques for the analysis of Big Data, KICS, 2014.11Google Scholar
Chieh-Yen Lin, Chang-Hao Tsai, Ching-Pei Lee, Chih-Jen Lin: Large-scale Logistic Regression and Linear Support Vector Machines Using Spark. 2014 IEEE International Conference on Big DataGoogle Scholar
Apache Spark: Spark Overview. http://spark.apache.org/docs/latestGoogle Scholar
Myung Soo Park, Jin Hee Na, Jin Young Choi: PCA-based Feature Extractiion using Class Information. Proceedings of KFIS Spring Conference 2005, Volume 15, Number 1Google Scholar

Recommendations

Learning Apache Spark 2.0
Read More
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
Read More
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and Services
October 2015
321 pages
ISBN:9781450338462
DOI:10.1145/2837060
Conference Chairs:
Jongsup Choi,
Sun Hwa Han,
Joo-Yeoun Lee,
Taeho Park,
Editor:
Aziz Nasridinov,
Program Chairs:
Carson K. Leung,
Yoo-Sung Kim,
Young-Koo Lee
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 October 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Computing Performance
Principle Component Analysis(PCA)
R
Spark
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 135
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Performance Evaluation of Apache Spark According to the Number of Nodes using Principal Component Analysis

BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and Services

ABSTRACT

References

Cited By

Recommendations

Learning Apache Spark 2.0

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Performance comparison of Apache Hadoop and Apache Spark