Towards a real-time processing framework based on improved distributed recurrent neural network variants with fastText for social big data analytics
Introduction
In the era of Big Data, the world’s largest technology organizations like Microsoft, Amazon, and Google have collected massive amounts of data estimated at the size of exabytes or larger. Social media companies like Twitter, Facebook, and YouTube have billions of users. Therefore, many businesses are employing social media to keep in touch with their clients, and promote the services and products offered. Clients also adopt social media to get information about interesting services or goods (Najafabadi, Villanustre, Khoshgoftaar, Seliya, Wald, Muharemagic, 2015, Salehan, Kim, 2016, Xiang, Schwartz, Gerdes Jr, Uysal, 2015).
The tremendous growth of social media with the users’ generated data provide an excellent opportunity to mine valuable insights and understand better users’ behaviors. This has motivated the development of big data solutions to solve many real-life issues (Najafabadi, Villanustre, Khoshgoftaar, Seliya, Wald, Muharemagic, 2015, Salehan, Kim, 2016, Xiang, Schwartz, Gerdes Jr, Uysal, 2015).
Generally, the term Big Data stands for data sets whose volume exceeds the capabilities of conventional tools to capture, manage, analyze and store data effectively (Bello-Orgaz, Jung, Camacho, 2016, García-Gil, Luengo, García, Herrera, 2019). The concept of Big data is characterized by the 5Vs (i.e., volume, velocity, variety, veracity, value).
- •
Volume: This characteristic refers to massive volumes of data generated every second. Finding valuable insights through the exploration and analysis processes create serious problems for traditional tools. For instance, Flickr generates nearly 3.6 TB of data and Google handles approximately 20,000 TB each day (Zhang, Yang, Chen, & Li, 2018). Furthermore, as stated by the authors (McAfee, Brynjolfsson, Davenport, Patil, & Barton, 2012), in 2012, approximately 2.5 exabytes of data were generated each day. This amount of data doubles every 40 months. According to (García-Gil, Luengo, García, Herrera, 2019, García-Gil, Ramírez-Gallego, García, Herrera, 2017), it is estimated that by 2020, the digital world will reach 44 zettabytes (i.e., 44 trillion gigabytes), which is 10 times larger than 4.4 zettabytes in 2013.
- •
Velocity: The speed at which data are generated and should be processed. In particular, with the proliferation of digital devices such as sensors and smartphones, the data generated has witnessed an unprecedented rate of data growth, which poses serious challenges to handle streaming data and perform real-time analytics (Bello-Orgaz, Jung, Camacho, 2016, Gandomi, Haider, 2015).
- •
Variety: It indicates the various types of data that may be available in a structured, semi-structured, or unstructured format. Structured data constitutes the smallest percentage of all the existing data, where relational databases represent a typical example of structured data. Semi-structured refers to data that does not conform to strict standards like XML (Extensible Markup Language). The third type is unstructured data, which represents more than 75% of big data. It includes audio, text, video and images (Bello-Orgaz, Jung, Camacho, 2016, Chen, Mao, Liu, 2014, Gandomi, Haider, 2015, Zhang, Yang, Chen, Li, 2018).
- •
Veracity: This characteristic was coined by IBM as the fourth V. It stands for the correctness and accuracy of data. In particular, with many forms of big data, quality and correctness are less controllable. For instance, the sentiments of users in social media are inherently uncertain because they involve human judgment. Nevertheless, they contain valuable information. Thus, the necessity to deal with uncertain and imprecise data is another aspect of Big Data, which should be addressed using the appropriate tools (Bello-Orgaz, Jung, Camacho, 2016, Gandomi, Haider, 2015, García-Gil, Luengo, García, Herrera, 2019).
- •
Value: Oracle added the so-called value as the fifth characteristic of Big data. At a simplistic level, big data has no intrinsic value. It becomes valuable only when we are able to derive the insights required to meet a particular need or address a problem. In other words, value is a key feature for any big data application as it allows generating useful business information (Bello-Orgaz, Jung, Camacho, 2016, Lee, 2017).
In the era of Big data, several companies around the world are using various solutions and techniques to analyze their huge amount of data in order to get meaningful insights. These techniques are called Big Data Analytics. In particular, the term big data analytics encompasses several algorithms, advanced statistics and applied analytics, which are used for various purposes such as prediction, classification, decision-making, and so on (Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua, 2019, Najafabadi, Villanustre, Khoshgoftaar, Seliya, Wald, Muharemagic, 2015, Saggi, Jain, 2018, Wang, Kung, Byrd, 2018a).
More particularly, in addition to analyzing a large amount of data, Big Data Analytics creates serious challenges for machine learning techniques and data analysis task, including noisy data, highly distributed input data sources, high-dimensionality, limited labeled data, and so on. Furthermore, there are other real problems in Big Data Analytics such as data indexing, data storage, and information retrieval. As a result, more sophisticated data analysis and data management tools are necessary to handle the massive amount of data and deal with various real-world problems in the context of Big Data (Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua, 2019, Najafabadi, Villanustre, Khoshgoftaar, Seliya, Wald, Muharemagic, 2015).
Due to the rapid development of social media, the huge amount of reviews generated caught the attention of several organizations, governments, businesses, and politicians to know the public opinion on various issues and understand the users’ behaviors for specific purposes. Moreover, many users follow the opinions and feedback of others to decide the quality of a product or service before purchase. Therefore, sentiment analysis plays an essential role in decision-making tasks. It represents a practical technique, which is commonly used to deduce the sentiment polarity from people’s opinions (Del Vecchio, Mele, Ndou, Secundo, 2018, Jianqiang, Xiaolin, 2017, Ngai, Tao, Moon, 2015, Pham, Le, 2018, Ragini, Anand, Bhaskar, 2018, Rezaeinia, Rahmani, Ghodsi, Veisi, 2019, Stieglitz, Mirbabaie, Ross, Neuberger, 2018, Valdivia, Luzón, Herrera, 2017).
Deep learning (DL) stands for an extremely active research area in the machine learning (ML) community. It encompasses a set of learning algorithms, which are intended to automatically learn the hierarchical representations and extract the high-level features based on deep architectures. In particular, deep learning models have achieved remarkable success in various natural language processing tasks, including text classification and sentiment analysis (Chen, Mao, Liu, 2014, Rezaeinia, Rahmani, Ghodsi, Veisi, 2019, Zhang, Yang, Chen, 2016).
Due to the different characteristics of big data, the design of an efficient big data system based on deep learning requires the consideration of many issues. In particular, because of the volume and the variety of big data sources, it is tricky to integrate effectively data collected from various distributed data sources. For example, more than 175 million tweets (i.e., unstructured data, which include images, videos, text, and so on) are posted by millions of user accounts in the whole world. In addition, it is necessary to store and handle the collected heterogeneous data efficiently. For instance, Facebook needs to manage, store and analyze more than 30 petabytes of data. Moreover, in order to take advantage of big data analytics, there is a need to analyze big data based on real-time, near real-time or batch processing. As a result, enhancing the performance of techniques used for a variety of tasks like classification, prediction, and visualization, is crucial for improving decision-making processes. Thus, many companies can benefit from the advantages of artificial intelligence with big data, increasing revenue by strengthening customer relationships (Hu, Wen, Chua, Li, 2014, Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua, 2019).
In this work, instead of directly applying DL models to real-world problems, one promising achievement is to improve the performance of the most powerful deep leaning models for social big data analytics. In general, the concept of social big data analytics can be deemed as the intersection between big data analytics and social media. Its main objective is to take advantage of the efforts of the two fields for analyzing and extracting relevant knowledge from the huge amount of social media data. However, real-time social big data analytics aims to manage the complexity of conducting social big data analytics in real-time, to process unbounded data streams, which is the goal of our proposed framework.
The contributions of this paper are summarized as follows:
- •
Instead of using deep learning models to classify the reviews directly, we proposed a procedure based on fastText word embedding and Recurrent neural network variants for sentiment analysis, which is devoted to representing textual data efficiently.
- •
We designed an efficient strategy based on machine learning, which is tailored to improve the performance in terms of classification accuracy of three distributed Recurrent neural network variants, namely, Distributed Long Short-Term Memory (LSTM), Distributed Bidirectional Long Short-Term Memory (BiLSTM) and Distributed Gated Recurrent Unit (GRU) models.
- •
We proposed a distributed intelligent system for real-time social big data analytics. It is based on a set of steps, which is dedicated not only to ingest, store, process, index, and visualize a large amount of information in real-time, but also it adopts a set of distributed machine learning and deep learning techniques for effective classification, prediction, and real-time analysis of customer behavior and public opinion.
- •
We conducted a set of experimentations using two real-world data sets. The experimental results demonstrate that our proposal yields better classification accuracy than existing state-of-the-art methods. Moreover, it is able to improve the performance of several existing works.
The rest of the paper is structured as follows. Next section outlines the related works. Section 3 describes Recurrent neural network (RNN) variants and fastText for sentiment analysis. Section 4 details our proposal. Section 5 describes the experiments. Finally, Section 6 concludes this paper.
Section snippets
Related work
With the advent of social networks, big data analytics played an important role in decision-making processes. Due to the remarkable success of Deep Learning (DL) approaches, various solutions have been proposed to cope with the challenges related to Natural language processing tasks.
In particular, several deep learning models for natural language processing have been designed based on employing word vector representations. For instance, Mikolov, Sutskever, Chen, Corrado, and Dean (2013)
Recurrent neural network variants with fastText for sentiment analysis
This section presents the word embedding technique called fastText, with Recurrent neural network, Long short-term memory, Bidirectional long short-term memory and Gated recurrent unit methods for Sentiment analysis.
Our proposal
In this section, we present a distributed intelligent system for social big data analytics, and our proposal for improving distributed RNN variants. The key idea is to perform real-time analysis, improve the classification performance of the most powerful RNN models, and handle large-scale data sets using parallel and distributed training.
Data sets
In this work, the experiments are conducted using two real-world data sets, namely, Yelp and Sentiment140.
- •
Yelp: This data set is composed 6,685,900 classified reviews provided by 1,637,138 users for 192,609 businesses. In this work, we have randomly selected 100,000 reviews as the original data set. We have considered 1 and 2 stars as a negative class, while 4 and 5 as a positive class (Yelp, 2019).
- •
Sentiment140: This is a Twitter sentiment analysis data set, which is originated from Stanford
Conclusions
With the advent of social media applications, social big data has become an important topic for a large number of research areas. In this paper, we have proposed a distributed intelligent system for real-time social big data analytics. It is primarily designed based on a set of powerful big data tools to manage and process, and analyze large-scale data efficiently, as well as, improve real-time decision-making processes. In addition, we have devised an effective solution to improve the
Acknowledgments
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
References (81)
- et al.
Deep learning-based sentiment classification of evaluative text based on multi-feature fusion
Information Processing & Management
(2019) - et al.
Apra: An approximate parallel recommendation algorithm for big data
Knowledge-Based Systems
(2018) - et al.
Enhancing aspect-based sentiment analysis of arabic hotels’ reviews using morphological, syntactic and semantic features
Information Processing & Management
(2019) - et al.
Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information
Cognitive Systems Research
(2019) - et al.
Social big data: Recent achievements and new challenges
Information Fusion
(2016) - et al.
Improving sentiment analysis via sentence type classification using BILSTM-CRF and CNN
Expert Systems with Applications
(2017) - et al.
Beyond the hype: Big data concepts, methods, and analytics
International Journal of Information Management
(2015) - et al.
Enabling smart data: Noise filtering in big data classification
Information Sciences
(2019) - et al.
Framewise phoneme classification with bidirectional LSTM and other neural network architectures
Neural Networks
(2005) - et al.
Deep learning in big data analytics: A comparative study
Computers & Electrical Engineering
(2019)