ABSTRACT
Social media provide continuous data streams that contain information with different level of sensitivity, validity and accuracy. Therefore, this type of information has to be properly filtered, extracted and processed to avoid noisy and inaccurate results. The main goal of this work is to propose architecture and workflow able to process Twitter social network data in near real-time. The primary design of the introduced modern architecture covers all processing aspects from data ingestion and storing to data processing and analysing. This paper presents Apache Spark and Hadoop implementation. The secondary objective is to analyse tweets with the defined topic --- floods. The word frequency method (Word Clouds) is shown as a major tool to analyse the content of the input dataset. The experimental architecture confirmed the usefulness of many well-known functions of Spark and Hadoop in the social data domain. The platforms which were used provided effective tools for optimal data ingesting, storing as well as processing and analysing. Based on the analytical part, it was observed that the word frequency method (n-grams) can effectively reveal the tweets content. According to the results of this study, the tweets proved their high informative potential regarding data quality and quantity.
- Martínez-Rojas, M., Pardo-Ferreira, M.C., and Rubio-Romero, J. C. 2018. Twitter as a tool for the management and analysis of emergency situations: A systematic literature review. Int J. Inform. Manage. 43, 196--208. DOI= https://doi.org/10.1016/j.ijinfomgt.2018.07.008.Google ScholarCross Ref
- Shafiee, M.E., Barker, Z., and Rasekh, A. 2018. Enhancing water system models by integrating big data. Sustain Cities Soc. 37, 485--491. DOI= https://doi.org/10.1016/j.scs.2017.11.042.Google ScholarCross Ref
- Martín, A., Julián, A.B.A., and Cos-Gayón, F. 2019. Analysis of Twitter messages using big data tools to evaluate and locate the activity in the city of Valencia (Spain). Cities. 86, 37--50. DOI=https://doi.org/10.1016/j.cities.2018.12.014.Google ScholarCross Ref
- Landwehr, P.M., Wei, W., Kowalchuk, M., and Carley, K. M. 2016. Using tweets to support disaster planning, warning and response. Safety Sci. 90, 33--47. DOI=https://doi.org/10.1016/j.ssci.2016.04.012.Google ScholarCross Ref
- Al-Daihani, S.M., and Abrahams, A. 2018. Analysis of Academic Libraries' Facebook Posts: Text and Data Analytics. J. Acad. Libr. 44, 216--225. DOI=https://doi.org/10.1016/j.acalib.2018.02.004.Google ScholarCross Ref
- Öztürk, N., and Ayvaz, S. 2018. Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis. Telemat. Inform. 35, 136--147. DOI=https://doi.org/10.1016/j.tele.2017.10.006.Google ScholarCross Ref
- Muralidharan, S., Rasmussen, L., Patterson, D., and Shin, J.H. 2011. Hope for Haiti: An analysis of Facebook and Twitter usage during the earthquake relief efforts. Public Relat. Rev. 37, 175--177. DOI= https://doi.org/10.1016/j.pubrev.2011.01.010.Google ScholarCross Ref
- Yoo, E., Rand, W., Eftekhar, M., and Rabinovich, E. 2016. Evaluating information diffusion speed and its determinants in social media networks during humanitarian crises. J. Oper. Manag. 45, 123--133. DOI=https://doi.org/10.1016/j.jom.2016.05.007.Google ScholarDigital Library
- Jongman, B., Wagemaker, J., Romero, B.R., and De Perez, E.C. 2015. Early Flood Detection for Rapid Humanitarian Response: Harnessing Near Real-Time Satellite and Twitter Signals. ISPRS J. Geo-Inf. 4, 2246--2266. DOI=https://doi.org/10.3390/ijgi4042246.Google ScholarCross Ref
- Kim, J., and Hastak M. 2018. Social network analysis: Characteristics of online social networks after a disaster. Int. J. Inform. Manage. 38, 86--96. DOI=https://doi.org/10.1016/j.ijinfomgt.2017.08.003.Google ScholarDigital Library
- Das, S., Behera, R.K., Kumar, M., and Rath, S.K. 2018. Real-Time Sentiment Analysis of Twitter Streaming data for Stock Prediction. Procedia Comput. Sci. 132, 956--964. DOI=https://doi.org/10.1016/j.procs.2018.05.111.Google ScholarDigital Library
- Cohen, J.C., and Acharya, S. 2014. Towards a trusted HDFS storage platform: Mitigating threats to Hadoop infrastructures using hardware-accelerated encryption with TPM-rooted key protection. J. Inform. Secur. Appl. 19, 224--244. DOI=https://doi.org/10.1016/j.jisa.2014.03.003.Google ScholarDigital Library
- Oussous, A., Benjelloun, F.Z., Lahcen, A.A., and Belfkih, S. 2018. Big Data technologies: A survey. Journal of King Saud University - Computer and Information Science. 4, 431--448. DOI=https://doi.org/10.1016/j.jksuci.2017.06.001.Google ScholarCross Ref
- Mavridis, I., and Karatza, H. 2017. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J. Syst. Software. 125, 133--151. DOI=https://doi.org/10.1016/j.jss.2016.11.037.Google ScholarDigital Library
- Zola, P., Cortez, P., and Carpita, M. 2019. Twitter user geolocation using web country noun searches. Decis. Support Syst. 120, 50--59. DOI=https://doi.org/10.1016/j.dss.2019.03.006.Google ScholarDigital Library
- Lansley, G., Longley, P.A. 2016. The geography of Twitter topics in London. Comput. Environ. Urban 58, 85--96. DOI= https://doi.org/10.1016/j.compenvurbsys.2016.04.002.Google ScholarCross Ref
- Alharbi, A.S.M., de Doncker E. 2019. Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information. Cogn. Syst. Res. 54, 50--61. DOI=https://doi.org/10.1016/j.cogsys.2018.10.001.Google ScholarCross Ref
Index Terms
- Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis
Recommendations
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing ResearchThe term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
Big data software analytics with Apache Spark
ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion ProceeedingsAt the beginning of every research effort, researchers in empirical software engineering have to go through the processes of extracting data from raw data sources and transforming them to what their tools expect as inputs. This step is time consuming ...
Comments