research-article

Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis

Authors:

Michal Podhoranyi,

Lukas VojacekAuthors Info & Claims

CCIOT '19: Proceedings of the 2019 4th International Conference on Cloud Computing and Internet of Things

Pages 1 - 6

https://doi.org/10.1145/3361821.3361825

Published: 20 September 2019 Publication History

Abstract

Social media provide continuous data streams that contain information with different level of sensitivity, validity and accuracy. Therefore, this type of information has to be properly filtered, extracted and processed to avoid noisy and inaccurate results. The main goal of this work is to propose architecture and workflow able to process Twitter social network data in near real-time. The primary design of the introduced modern architecture covers all processing aspects from data ingestion and storing to data processing and analysing. This paper presents Apache Spark and Hadoop implementation. The secondary objective is to analyse tweets with the defined topic --- floods. The word frequency method (Word Clouds) is shown as a major tool to analyse the content of the input dataset. The experimental architecture confirmed the usefulness of many well-known functions of Spark and Hadoop in the social data domain. The platforms which were used provided effective tools for optimal data ingesting, storing as well as processing and analysing. Based on the analytical part, it was observed that the word frequency method (n-grams) can effectively reveal the tweets content. According to the results of this study, the tweets proved their high informative potential regarding data quality and quantity.

References

[1]

Martínez-Rojas, M., Pardo-Ferreira, M.C., and Rubio-Romero, J. C. 2018. Twitter as a tool for the management and analysis of emergency situations: A systematic literature review. Int J. Inform. Manage. 43, 196--208. DOI= https://doi.org/10.1016/j.ijinfomgt.2018.07.008.

[2]

Shafiee, M.E., Barker, Z., and Rasekh, A. 2018. Enhancing water system models by integrating big data. Sustain Cities Soc. 37, 485--491. DOI= https://doi.org/10.1016/j.scs.2017.11.042.

[3]

Martín, A., Julián, A.B.A., and Cos-Gayón, F. 2019. Analysis of Twitter messages using big data tools to evaluate and locate the activity in the city of Valencia (Spain). Cities. 86, 37--50. DOI=https://doi.org/10.1016/j.cities.2018.12.014.

[4]

Landwehr, P.M., Wei, W., Kowalchuk, M., and Carley, K. M. 2016. Using tweets to support disaster planning, warning and response. Safety Sci. 90, 33--47. DOI=https://doi.org/10.1016/j.ssci.2016.04.012.

[5]

Al-Daihani, S.M., and Abrahams, A. 2018. Analysis of Academic Libraries' Facebook Posts: Text and Data Analytics. J. Acad. Libr. 44, 216--225. DOI=https://doi.org/10.1016/j.acalib.2018.02.004.

[6]

Öztürk, N., and Ayvaz, S. 2018. Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis. Telemat. Inform. 35, 136--147. DOI=https://doi.org/10.1016/j.tele.2017.10.006.

[7]

Muralidharan, S., Rasmussen, L., Patterson, D., and Shin, J.H. 2011. Hope for Haiti: An analysis of Facebook and Twitter usage during the earthquake relief efforts. Public Relat. Rev. 37, 175--177. DOI= https://doi.org/10.1016/j.pubrev.2011.01.010.

[8]

Yoo, E., Rand, W., Eftekhar, M., and Rabinovich, E. 2016. Evaluating information diffusion speed and its determinants in social media networks during humanitarian crises. J. Oper. Manag. 45, 123--133. DOI=https://doi.org/10.1016/j.jom.2016.05.007.

Digital Library

[9]

Jongman, B., Wagemaker, J., Romero, B.R., and De Perez, E.C. 2015. Early Flood Detection for Rapid Humanitarian Response: Harnessing Near Real-Time Satellite and Twitter Signals. ISPRS J. Geo-Inf. 4, 2246--2266. DOI=https://doi.org/10.3390/ijgi4042246.

[10]

Kim, J., and Hastak M. 2018. Social network analysis: Characteristics of online social networks after a disaster. Int. J. Inform. Manage. 38, 86--96. DOI=https://doi.org/10.1016/j.ijinfomgt.2017.08.003.

Digital Library

[11]

Das, S., Behera, R.K., Kumar, M., and Rath, S.K. 2018. Real-Time Sentiment Analysis of Twitter Streaming data for Stock Prediction. Procedia Comput. Sci. 132, 956--964. DOI=https://doi.org/10.1016/j.procs.2018.05.111.

Digital Library

[12]

Cohen, J.C., and Acharya, S. 2014. Towards a trusted HDFS storage platform: Mitigating threats to Hadoop infrastructures using hardware-accelerated encryption with TPM-rooted key protection. J. Inform. Secur. Appl. 19, 224--244. DOI=https://doi.org/10.1016/j.jisa.2014.03.003.

Digital Library

[13]

Oussous, A., Benjelloun, F.Z., Lahcen, A.A., and Belfkih, S. 2018. Big Data technologies: A survey. Journal of King Saud University - Computer and Information Science. 4, 431--448. DOI=https://doi.org/10.1016/j.jksuci.2017.06.001.

[14]

Mavridis, I., and Karatza, H. 2017. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J. Syst. Software. 125, 133--151. DOI=https://doi.org/10.1016/j.jss.2016.11.037.

Digital Library

[15]

Zola, P., Cortez, P., and Carpita, M. 2019. Twitter user geolocation using web country noun searches. Decis. Support Syst. 120, 50--59. DOI=https://doi.org/10.1016/j.dss.2019.03.006.

Digital Library

[16]

Lansley, G., Longley, P.A. 2016. The geography of Twitter topics in London. Comput. Environ. Urban 58, 85--96. DOI= https://doi.org/10.1016/j.compenvurbsys.2016.04.002.

[17]

Alharbi, A.S.M., de Doncker E. 2019. Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information. Cogn. Syst. Res. 54, 50--61. DOI=https://doi.org/10.1016/j.cogsys.2018.10.001.

Cited By

Mysiuk IMysiuk RShuvar RYuzevych VPavlenchyk ADalyk V(2024)Designing a Data Pipeline Architecture for Intelligent Analysis of Streaming DataScience, Engineering Management and Information Technology10.1007/978-3-031-72284-4_22(361-372)Online publication date: 12-Sep-2024
https://doi.org/10.1007/978-3-031-72284-4_22
Gutierrez CWhittaker APatenio KGehman JLefsrud LBarbosa DStroulia EOnuţ IZulkernine F(2021)Analyzing and visualizing Twitter conversationsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507791(4-13)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.5555/3507788.3507791
Khan MYu W(2021)ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data AnalyticsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472518(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472518

Index Terms

Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures

Recommendations

Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
Big Data Processing Using Spark in Cloud
Identifying the potential of near data processing for apache spark
MEMSYS '17: Proceedings of the International Symposium on Memory Systems

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CCIOT '19: Proceedings of the 2019 4th International Conference on Cloud Computing and Internet of Things

September 2019

134 pages

ISBN:9781450372411

DOI:10.1145/3361821

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Waseda University: Waseda University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CCIOT 2019

CCIOT 2019: 2019 4th International Conference on Cloud Computing and Internet of Things

September 20 - 22, 2019

Tokyo, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
315
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)11

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mysiuk IMysiuk RShuvar RYuzevych VPavlenchyk ADalyk V(2024)Designing a Data Pipeline Architecture for Intelligent Analysis of Streaming DataScience, Engineering Management and Information Technology10.1007/978-3-031-72284-4_22(361-372)Online publication date: 12-Sep-2024
https://doi.org/10.1007/978-3-031-72284-4_22
Gutierrez CWhittaker APatenio KGehman JLefsrud LBarbosa DStroulia EOnuţ IZulkernine F(2021)Analyzing and visualizing Twitter conversationsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507791(4-13)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.5555/3507788.3507791
Khan MYu W(2021)ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data AnalyticsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472518(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472518
Al-Obeidat FBani-Hani AAdedugbe OMajdalawieh MBenkhelifa E(2021)A microservices persistence technique for cloud-based online social data analysisCluster Computing10.1007/s10586-021-03244-024:3(2341-2353)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s10586-021-03244-0

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten