Towards a real-time processing framework based on improved distributed recurrent neural network variants with fastText for social big data analytics

https://doi.org/10.1016/j.ipm.2019.102122Get rights and content

Highlights

  • We present an effective distributed intelligent system for real-time social big data analytics, which is dedicated to ingest, store, process, index, and visualize huge amount of information.

  • The system takes advantage of distributed machine learning and deep learning techniques for enhancing decision-making processes in the context of big data.

  • We propose an efficient strategy based on FastText word embedding and Recurrent neural network variants to learn textual data representations efficiently.

  • We devise a solution to improve the performance of well-known Recurrent neural network models called LSTM, BiLSTM and GRU for sentiment analysis.

  • The experimental results prove the effectiveness of our proposal.

Abstract

Big data generated by social media stands for a valuable source of information, which offers an excellent opportunity to mine valuable insights. Particularly, User-generated contents such as reviews, recommendations, and users’ behavior data are useful for supporting several marketing activities of many companies. Knowing what users are saying about the products they bought or the services they used through reviews in social media represents a key factor for making decisions. Sentiment analysis is one of the fundamental tasks in Natural Language Processing. Although deep learning for sentiment analysis has achieved great success and allowed several firms to analyze and extract relevant information from their textual data, but as the volume of data grows, a model that runs in a traditional environment cannot be effective, which implies the importance of efficient distributed deep learning models for social Big Data analytics. Besides, it is known that social media analysis is a complex process, which involves a set of complex tasks. Therefore, it is important to address the challenges and issues of social big data analytics and enhance the performance of deep learning techniques in terms of classification accuracy to obtain better decisions.

In this paper, we propose an approach for sentiment analysis, which is devoted to adopting fastText with Recurrent neural network variants to represent textual data efficiently. Then, it employs the new representations to perform the classification task. Its main objective is to enhance the performance of well-known Recurrent Neural Network (RNN) variants in terms of classification accuracy and handle large scale data. In addition, we propose a distributed intelligent system for real-time social big data analytics. It is designed to ingest, store, process, index, and visualize the huge amount of information in real-time. The proposed system adopts distributed machine learning with our proposed method for enhancing decision-making processes. Extensive experiments conducted on two benchmark data sets demonstrate that our proposal for sentiment analysis outperforms well-known distributed recurrent neural network variants (i.e., Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), and Gated Recurrent Unit (GRU)). Specifically, we tested the efficiency of our approach using the three different deep learning models. The results show that our proposed approach is able to enhance the performance of the three models. The current work can provide several benefits for researchers and practitioners who want to collect, handle, analyze and visualize several sources of information in real-time. Also, it can contribute to a better understanding of public opinion and user behaviors using our proposed system with the improved variants of the most powerful distributed deep learning and machine learning algorithms. Furthermore, it is able to increase the classification accuracy of several existing works based on RNN models for sentiment analysis.

Introduction

In the era of Big Data, the world’s largest technology organizations like Microsoft, Amazon, and Google have collected massive amounts of data estimated at the size of exabytes or larger. Social media companies like Twitter, Facebook, and YouTube have billions of users. Therefore, many businesses are employing social media to keep in touch with their clients, and promote the services and products offered. Clients also adopt social media to get information about interesting services or goods (Najafabadi, Villanustre, Khoshgoftaar, Seliya, Wald, Muharemagic, 2015, Salehan, Kim, 2016, Xiang, Schwartz, Gerdes Jr, Uysal, 2015).

The tremendous growth of social media with the users’ generated data provide an excellent opportunity to mine valuable insights and understand better users’ behaviors. This has motivated the development of big data solutions to solve many real-life issues (Najafabadi, Villanustre, Khoshgoftaar, Seliya, Wald, Muharemagic, 2015, Salehan, Kim, 2016, Xiang, Schwartz, Gerdes Jr, Uysal, 2015).

Generally, the term Big Data stands for data sets whose volume exceeds the capabilities of conventional tools to capture, manage, analyze and store data effectively (Bello-Orgaz, Jung, Camacho, 2016, García-Gil, Luengo, García, Herrera, 2019). The concept of Big data is characterized by the 5Vs (i.e., volume, velocity, variety, veracity, value).

  • Volume: This characteristic refers to massive volumes of data generated every second. Finding valuable insights through the exploration and analysis processes create serious problems for traditional tools. For instance, Flickr generates nearly 3.6 TB of data and Google handles approximately 20,000 TB each day (Zhang, Yang, Chen, & Li, 2018). Furthermore, as stated by the authors (McAfee, Brynjolfsson, Davenport, Patil, & Barton, 2012), in 2012, approximately 2.5 exabytes of data were generated each day. This amount of data doubles every 40 months. According to  (García-Gil, Luengo, García, Herrera, 2019, García-Gil, Ramírez-Gallego, García, Herrera, 2017), it is estimated that by 2020, the digital world will reach 44 zettabytes (i.e., 44 trillion gigabytes), which is 10 times larger than 4.4 zettabytes in 2013.

  • Velocity: The speed at which data are generated and should be processed. In particular, with the proliferation of digital devices such as sensors and smartphones, the data generated has witnessed an unprecedented rate of data growth, which poses serious challenges to handle streaming data and perform real-time analytics (Bello-Orgaz, Jung, Camacho, 2016, Gandomi, Haider, 2015).

  • Variety: It indicates the various types of data that may be available in a structured, semi-structured, or unstructured format. Structured data constitutes the smallest percentage of all the existing data, where relational databases represent a typical example of structured data. Semi-structured refers to data that does not conform to strict standards like XML (Extensible Markup Language). The third type is unstructured data, which represents more than 75% of big data. It includes audio, text, video and images (Bello-Orgaz, Jung, Camacho, 2016, Chen, Mao, Liu, 2014, Gandomi, Haider, 2015, Zhang, Yang, Chen, Li, 2018).

  • Veracity: This characteristic was coined by IBM as the fourth V. It stands for the correctness and accuracy of data. In particular, with many forms of big data, quality and correctness are less controllable. For instance, the sentiments of users in social media are inherently uncertain because they involve human judgment. Nevertheless, they contain valuable information. Thus, the necessity to deal with uncertain and imprecise data is another aspect of Big Data, which should be addressed using the appropriate tools (Bello-Orgaz, Jung, Camacho, 2016, Gandomi, Haider, 2015, García-Gil, Luengo, García, Herrera, 2019).

  • Value: Oracle added the so-called value as the fifth characteristic of Big data. At a simplistic level, big data has no intrinsic value. It becomes valuable only when we are able to derive the insights required to meet a particular need or address a problem. In other words, value is a key feature for any big data application as it allows generating useful business information (Bello-Orgaz, Jung, Camacho, 2016, Lee, 2017).

In the era of Big data, several companies around the world are using various solutions and techniques to analyze their huge amount of data in order to get meaningful insights. These techniques are called Big Data Analytics. In particular, the term big data analytics encompasses several algorithms, advanced statistics and applied analytics, which are used for various purposes such as prediction, classification, decision-making, and so on (Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua, 2019, Najafabadi, Villanustre, Khoshgoftaar, Seliya, Wald, Muharemagic, 2015, Saggi, Jain, 2018, Wang, Kung, Byrd, 2018a).

More particularly, in addition to analyzing a large amount of data, Big Data Analytics creates serious challenges for machine learning techniques and data analysis task, including noisy data, highly distributed input data sources, high-dimensionality, limited labeled data, and so on. Furthermore, there are other real problems in Big Data Analytics such as data indexing, data storage, and information retrieval. As a result, more sophisticated data analysis and data management tools are necessary to handle the massive amount of data and deal with various real-world problems in the context of Big Data (Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua, 2019, Najafabadi, Villanustre, Khoshgoftaar, Seliya, Wald, Muharemagic, 2015).

Due to the rapid development of social media, the huge amount of reviews generated caught the attention of several organizations, governments, businesses, and politicians to know the public opinion on various issues and understand the users’ behaviors for specific purposes. Moreover, many users follow the opinions and feedback of others to decide the quality of a product or service before purchase. Therefore, sentiment analysis plays an essential role in decision-making tasks. It represents a practical technique, which is commonly used to deduce the sentiment polarity from people’s opinions (Del Vecchio, Mele, Ndou, Secundo, 2018, Jianqiang, Xiaolin, 2017, Ngai, Tao, Moon, 2015, Pham, Le, 2018, Ragini, Anand, Bhaskar, 2018, Rezaeinia, Rahmani, Ghodsi, Veisi, 2019, Stieglitz, Mirbabaie, Ross, Neuberger, 2018, Valdivia, Luzón, Herrera, 2017).

Deep learning (DL) stands for an extremely active research area in the machine learning (ML) community. It encompasses a set of learning algorithms, which are intended to automatically learn the hierarchical representations and extract the high-level features based on deep architectures. In particular, deep learning models have achieved remarkable success in various natural language processing tasks, including text classification and sentiment analysis (Chen, Mao, Liu, 2014, Rezaeinia, Rahmani, Ghodsi, Veisi, 2019, Zhang, Yang, Chen, 2016).

Due to the different characteristics of big data, the design of an efficient big data system based on deep learning requires the consideration of many issues. In particular, because of the volume and the variety of big data sources, it is tricky to integrate effectively data collected from various distributed data sources. For example, more than 175 million tweets (i.e., unstructured data, which include images, videos, text, and so on) are posted by millions of user accounts in the whole world. In addition, it is necessary to store and handle the collected heterogeneous data efficiently. For instance, Facebook needs to manage, store and analyze more than 30 petabytes of data. Moreover, in order to take advantage of big data analytics, there is a need to analyze big data based on real-time, near real-time or batch processing. As a result, enhancing the performance of techniques used for a variety of tasks like classification, prediction, and visualization, is crucial for improving decision-making processes. Thus, many companies can benefit from the advantages of artificial intelligence with big data, increasing revenue by strengthening customer relationships (Hu, Wen, Chua, Li, 2014, Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua, 2019).

In this work, instead of directly applying DL models to real-world problems, one promising achievement is to improve the performance of the most powerful deep leaning models for social big data analytics. In general, the concept of social big data analytics can be deemed as the intersection between big data analytics and social media. Its main objective is to take advantage of the efforts of the two fields for analyzing and extracting relevant knowledge from the huge amount of social media data. However, real-time social big data analytics aims to manage the complexity of conducting social big data analytics in real-time, to process unbounded data streams, which is the goal of our proposed framework.

The contributions of this paper are summarized as follows:

  • Instead of using deep learning models to classify the reviews directly, we proposed a procedure based on fastText word embedding and Recurrent neural network variants for sentiment analysis, which is devoted to representing textual data efficiently.

  • We designed an efficient strategy based on machine learning, which is tailored to improve the performance in terms of classification accuracy of three distributed Recurrent neural network variants, namely, Distributed Long Short-Term Memory (LSTM), Distributed Bidirectional Long Short-Term Memory (BiLSTM) and Distributed Gated Recurrent Unit (GRU) models.

  • We proposed a distributed intelligent system for real-time social big data analytics. It is based on a set of steps, which is dedicated not only to ingest, store, process, index, and visualize a large amount of information in real-time, but also it adopts a set of distributed machine learning and deep learning techniques for effective classification, prediction, and real-time analysis of customer behavior and public opinion.

  • We conducted a set of experimentations using two real-world data sets. The experimental results demonstrate that our proposal yields better classification accuracy than existing state-of-the-art methods. Moreover, it is able to improve the performance of several existing works.

The rest of the paper is structured as follows. Next section outlines the related works. Section 3 describes Recurrent neural network (RNN) variants and fastText for sentiment analysis. Section 4 details our proposal. Section 5 describes the experiments. Finally, Section 6 concludes this paper.

Section snippets

Related work

With the advent of social networks, big data analytics played an important role in decision-making processes. Due to the remarkable success of Deep Learning (DL) approaches, various solutions have been proposed to cope with the challenges related to Natural language processing tasks.

In particular, several deep learning models for natural language processing have been designed based on employing word vector representations. For instance, Mikolov, Sutskever, Chen, Corrado, and Dean (2013)

Recurrent neural network variants with fastText for sentiment analysis

This section presents the word embedding technique called fastText, with Recurrent neural network, Long short-term memory, Bidirectional long short-term memory and Gated recurrent unit methods for Sentiment analysis.

Our proposal

In this section, we present a distributed intelligent system for social big data analytics, and our proposal for improving distributed RNN variants. The key idea is to perform real-time analysis, improve the classification performance of the most powerful RNN models, and handle large-scale data sets using parallel and distributed training.

Data sets

In this work, the experiments are conducted using two real-world data sets, namely, Yelp and Sentiment140.

  • Yelp: This data set is composed 6,685,900 classified reviews provided by 1,637,138 users for 192,609 businesses. In this work, we have randomly selected 100,000 reviews as the original data set. We have considered 1 and 2 stars as a negative class, while 4 and 5 as a positive class (Yelp, 2019).

  • Sentiment140: This is a Twitter sentiment analysis data set, which is originated from Stanford

Conclusions

With the advent of social media applications, social big data has become an important topic for a large number of research areas. In this paper, we have proposed a distributed intelligent system for real-time social big data analytics. It is primarily designed based on a set of powerful big data tools to manage and process, and analyze large-scale data efficiently, as well as, improve real-time decision-making processes. In addition, we have devised an effective solution to improve the

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References (81)

  • J.L. Jimenez-Marquez et al.

    Towards a big data framework for analyzing social media content

    International Journal of Information Management

    (2019)
  • I. Lee

    Big data: Dimensions, evolution, impacts, and challenges

    Business Horizons

    (2017)
  • G. Liu et al.

    Bidirectional LSTM with attention mechanism and convolutional layer for text classification

    Neurocomputing

    (2019)
  • E.W. Ngai et al.

    Social media research: Theories, constructs, and conceptual frameworks

    International Journal of Information Management

    (2015)
  • D.-H. Pham et al.

    Exploiting multiple word embeddings and one-hot character vectors for aspect-based sentiment analysis

    International Journal of Approximate Reasoning

    (2018)
  • J.R. Ragini et al.

    Big data analytics for disaster response and recovery through sentiment analysis

    International Journal of Information Management

    (2018)
  • S.M. Rezaeinia et al.

    Sentiment analysis based on improved pre-trained word embeddings

    Expert Systems with Applications

    (2019)
  • M.K. Saggi et al.

    A survey towards an integration of big data analytics to big insights for value-creation

    Information Processing & Management

    (2018)
  • M. Salehan et al.

    Predicting the performance of online consumer reviews: A sentiment mining approach to big data analytics

    Decision Support Systems

    (2016)
  • M. Song et al.

    Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis in korean

    Information Processing & Management

    (2019)
  • S. Stieglitz et al.

    Social media analytics–challenges in topic discovery, data collection, and data preparation

    International Journal of Information Management

    (2018)
  • Y. Wang et al.

    Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations

    Technological Forecasting and Social Change

    (2018)
  • Z. Xiang et al.

    What can big data and text analytics tell us about hotel guest experience and satisfaction?

    International Journal of Hospitality Management

    (2015)
  • BigDL (2019). Distributed deep learning library for apache spark. (Accessed: 10 April 2019) URL:...
  • P. Bojanowski et al.

    Enriching word vectors with subword information

    Transactions of the Association for Computational Linguistics

    (2017)
  • Cassandra (2019). Apache cassandra. (Accessed: 10 April 2019) URL:...
  • G.C. Cawley et al.

    Sparse multinomial logistic regression via Bayesian l1 regularisation

    Advances in neural information processing systems

    (2007)
  • M. Chen et al.

    Big data: A survey

    Mobile Networks and Applications

    (2014)
  • T. Chen et al.

    Xgboost: A scalable tree boosting system

    Proceedings of the 22nd ACM SIGKDDinternational conference on knowledge discovery and data mining

    (2016)
  • K. Cho et al.

    Learning phrase representations using RNN encoder–decoder for statistical machine translation

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    (2014)
  • J. Chung et al.

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    Nips 2014 workshop on deep learning, December 2014

    (2014)
  • P. Del Vecchio et al.

    Creating value from social big data: implications for smart tourism destinations

    Information Processing & Management

    (2018)
  • Elasticsearch (2019). Elasticsearch. (Accessed: 10 April 2019) URL:...
  • FastText (2019). Fasttext: Library for efficient text classification and representation learning. (Accessed: 10 April...
  • D. García-Gil et al.

    A comparison on scalability for batch big data processing on apache spark and apache flink

    Big Data Analytics

    (2017)
  • C. Gormley et al.

    Elasticsearch: The definitive guide: A distributed real-time search and analytics engine

    (2015)
  • A. Graves et al.

    Speech recognition with deep recurrent neural networks

    2013 IEEE international conference on acoustics, speech and signal processing

    (2013)
  • K. Greff et al.

    Lstm: A search space odyssey

    IEEE Transactions on Neural Networks and Learning Systems

    (2017)
  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks...
  • S. Hochreiter et al.

    Long short-term memory

    Neural Computation

    (1997)
  • Cited by (0)

    View full text