skip to main content
10.1145/3361821.3361825acmotherconferencesArticle/Chapter ViewAbstractPublication PagescciotConference Proceedingsconference-collections
research-article

Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis

Authors Info & Claims
Published:20 September 2019Publication History

ABSTRACT

Social media provide continuous data streams that contain information with different level of sensitivity, validity and accuracy. Therefore, this type of information has to be properly filtered, extracted and processed to avoid noisy and inaccurate results. The main goal of this work is to propose architecture and workflow able to process Twitter social network data in near real-time. The primary design of the introduced modern architecture covers all processing aspects from data ingestion and storing to data processing and analysing. This paper presents Apache Spark and Hadoop implementation. The secondary objective is to analyse tweets with the defined topic --- floods. The word frequency method (Word Clouds) is shown as a major tool to analyse the content of the input dataset. The experimental architecture confirmed the usefulness of many well-known functions of Spark and Hadoop in the social data domain. The platforms which were used provided effective tools for optimal data ingesting, storing as well as processing and analysing. Based on the analytical part, it was observed that the word frequency method (n-grams) can effectively reveal the tweets content. According to the results of this study, the tweets proved their high informative potential regarding data quality and quantity.

References

  1. Martínez-Rojas, M., Pardo-Ferreira, M.C., and Rubio-Romero, J. C. 2018. Twitter as a tool for the management and analysis of emergency situations: A systematic literature review. Int J. Inform. Manage. 43, 196--208. DOI= https://doi.org/10.1016/j.ijinfomgt.2018.07.008.Google ScholarGoogle ScholarCross RefCross Ref
  2. Shafiee, M.E., Barker, Z., and Rasekh, A. 2018. Enhancing water system models by integrating big data. Sustain Cities Soc. 37, 485--491. DOI= https://doi.org/10.1016/j.scs.2017.11.042.Google ScholarGoogle ScholarCross RefCross Ref
  3. Martín, A., Julián, A.B.A., and Cos-Gayón, F. 2019. Analysis of Twitter messages using big data tools to evaluate and locate the activity in the city of Valencia (Spain). Cities. 86, 37--50. DOI=https://doi.org/10.1016/j.cities.2018.12.014.Google ScholarGoogle ScholarCross RefCross Ref
  4. Landwehr, P.M., Wei, W., Kowalchuk, M., and Carley, K. M. 2016. Using tweets to support disaster planning, warning and response. Safety Sci. 90, 33--47. DOI=https://doi.org/10.1016/j.ssci.2016.04.012.Google ScholarGoogle ScholarCross RefCross Ref
  5. Al-Daihani, S.M., and Abrahams, A. 2018. Analysis of Academic Libraries' Facebook Posts: Text and Data Analytics. J. Acad. Libr. 44, 216--225. DOI=https://doi.org/10.1016/j.acalib.2018.02.004.Google ScholarGoogle ScholarCross RefCross Ref
  6. Öztürk, N., and Ayvaz, S. 2018. Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis. Telemat. Inform. 35, 136--147. DOI=https://doi.org/10.1016/j.tele.2017.10.006.Google ScholarGoogle ScholarCross RefCross Ref
  7. Muralidharan, S., Rasmussen, L., Patterson, D., and Shin, J.H. 2011. Hope for Haiti: An analysis of Facebook and Twitter usage during the earthquake relief efforts. Public Relat. Rev. 37, 175--177. DOI= https://doi.org/10.1016/j.pubrev.2011.01.010.Google ScholarGoogle ScholarCross RefCross Ref
  8. Yoo, E., Rand, W., Eftekhar, M., and Rabinovich, E. 2016. Evaluating information diffusion speed and its determinants in social media networks during humanitarian crises. J. Oper. Manag. 45, 123--133. DOI=https://doi.org/10.1016/j.jom.2016.05.007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jongman, B., Wagemaker, J., Romero, B.R., and De Perez, E.C. 2015. Early Flood Detection for Rapid Humanitarian Response: Harnessing Near Real-Time Satellite and Twitter Signals. ISPRS J. Geo-Inf. 4, 2246--2266. DOI=https://doi.org/10.3390/ijgi4042246.Google ScholarGoogle ScholarCross RefCross Ref
  10. Kim, J., and Hastak M. 2018. Social network analysis: Characteristics of online social networks after a disaster. Int. J. Inform. Manage. 38, 86--96. DOI=https://doi.org/10.1016/j.ijinfomgt.2017.08.003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Das, S., Behera, R.K., Kumar, M., and Rath, S.K. 2018. Real-Time Sentiment Analysis of Twitter Streaming data for Stock Prediction. Procedia Comput. Sci. 132, 956--964. DOI=https://doi.org/10.1016/j.procs.2018.05.111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cohen, J.C., and Acharya, S. 2014. Towards a trusted HDFS storage platform: Mitigating threats to Hadoop infrastructures using hardware-accelerated encryption with TPM-rooted key protection. J. Inform. Secur. Appl. 19, 224--244. DOI=https://doi.org/10.1016/j.jisa.2014.03.003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Oussous, A., Benjelloun, F.Z., Lahcen, A.A., and Belfkih, S. 2018. Big Data technologies: A survey. Journal of King Saud University - Computer and Information Science. 4, 431--448. DOI=https://doi.org/10.1016/j.jksuci.2017.06.001.Google ScholarGoogle ScholarCross RefCross Ref
  14. Mavridis, I., and Karatza, H. 2017. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J. Syst. Software. 125, 133--151. DOI=https://doi.org/10.1016/j.jss.2016.11.037.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Zola, P., Cortez, P., and Carpita, M. 2019. Twitter user geolocation using web country noun searches. Decis. Support Syst. 120, 50--59. DOI=https://doi.org/10.1016/j.dss.2019.03.006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lansley, G., Longley, P.A. 2016. The geography of Twitter topics in London. Comput. Environ. Urban 58, 85--96. DOI= https://doi.org/10.1016/j.compenvurbsys.2016.04.002.Google ScholarGoogle ScholarCross RefCross Ref
  17. Alharbi, A.S.M., de Doncker E. 2019. Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information. Cogn. Syst. Res. 54, 50--61. DOI=https://doi.org/10.1016/j.cogsys.2018.10.001.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis

    Recommendations

    Reviews

    Dominik Strzalka

    What would we do without social media What would the world look like if there weren't continuous data streams If we refer back to our history, the first big breakthrough was around 1440, when Johannes Gutenberg started his printing technology. This was the beginning of information spreading around the world. At first, this was done very slowly, but then in the early 17th century (from 1605), a new and interesting idea appeared: the newspaper. Information could now spread more quickly. And when Gauss, Weber, Wheatstone, and Morse introduced their various telegraphs, it was clear that information spreads even faster with a pair of wires. After the discoveries of Tesla and Marconi, with their new invention called the radio, the speed of information spreading reached almost the speed of light. However, this communication was rather one-directional, and its flow was very limited: the reader/listener had no or limited opportunity to create/respond and quickly spread news. As an example: the citizens band (CB) radio was not so popular. Still, all of these inventions were only a prelude to what we have today. An increase in information flow intensity has been observed since the 1970s, when it became clear that the idea of a global market was not a dream but a reality. Starting from Sydney, through Tokyo, Bombay, Frankfurt, Paris, and London, and reaching the New York Stock Exchange, the markets worked almost the whole day with constant data flow about changes in share prices. It was only a matter of time before this situation became the norm, though in different dimensions. This is possible thanks to the Internet: one of its applications-social media-has taken over the world, generating flows of information. Different social media services generate data streams of information with different levels of sensitivity, validity, and accuracy. The main contribution of this paper is an architecture that is able to process Twitter's data streams. The authors propose a five-component system: (1) data ingestion based on Apache Flume; (2) data storage on the Hadoop Distributed File System (HDFS), where tweets are broken into separate blocks and distributed to nodes; (3) a data warehouse: Apache HIVE with HiveQL to store data in the form of a table for further analysis; (4) a resource manager for job scheduling with yet another resource negotiator (YARN); and (5) the SPARK processing engine. Data from Twitter is very easily available with application programming interface (API) access. As an experiment, the authors apply the word frequency method ( n -grams) to two datasets: 1,000 tweets with the keyword flood (completed on April 10, 2018) and 10,000 tweets with the keyword flood (completed on April 25, 2018). The proposed architecture works very well to uncover the content ... in the tweets. It should be noted that the processing of social media data is not trivial, but a novel attempt to show how the Twitter data stream can be processed by the Apache Spark big data platform.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      CCIOT '19: Proceedings of the 2019 4th International Conference on Cloud Computing and Internet of Things
      September 2019
      134 pages
      ISBN:9781450372411
      DOI:10.1145/3361821

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 September 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader