Keywords

1 Motivation and Challenges

Recently, organizations have commenced to rely heavily on external data -specially Twitter data - to perform sentiment analysis in order to get a better grasp on how their enterprise, products, services and processes are perceived by customers at real-time. In particular, a vast volume of the Twitter data exhibit emotions of consumers. A realtime analysis with Twitter data results in timely decisions and interventions from the organization, such as adapting their offer to the consumer expectations. However, realtime analysis on Twitter data is enormously challenging. The most critical challenges are two-fold: (i) unlike the classic relational data, Twitter data are unstructured, whilst (ii) the velocity (speed) of data is extremely high and unpredictable. For instance, on average, more than 6000 tweets are tweeted every second. Several sentiment analytics are proposed in literature e.g., [1,2,3,4]. Unfortunately however, to the best of our knowledge, these solutions merely exploit hashtags which contain a small fragment of a tweet. In our view, this is clearly not sufficient for performing complete analysis because it lacks the ability to realize the contexts of tweets. In addition, these solutions are built-on traditional architectural paradigm. Therefore, in this paper, we propose SANA, a service-based solution for realtime sentiment analysis with the Twitter data, which takes into account the context and the content of the tweets.

2 System Overview

The multi-layered architecture of SANA consists of various components, which are briefly described in the following.

Data Collection and Ingestion Layer: This layer contains two components: a data collector and a data ingestor. The data collector is a client which binds one or many data source APIs that enable an access to remote repositories with an authentication check through their public keys. Once the connection is established, the data collector starts fetching data streams (i.e., tweets) in realtime. The data ingestor consists of two interfaces. The first interface taps data into SANA data lake which is a distributed Hadoop cluster, reside in the storage layer. The other interface opens a channel to push tweets directly to the data processing components.

Data Processing Layer: The components contained in this layer perform several tasks. The two main tasks are carried out in this layer include data analysis and visualization. Data distribution and query execution are two additional tasks performed in this layer. The analysis starts with filtering incoming Twitter data. SANA’s data filter eliminates unnecessary strings from tweets and keeps the core text required for analysis. Also, it allocates an unique identifier to each tweet. Then, the text classifier extracts and classifies positive and negative sentiments from the texts. We used the multinomial naïve-bayes classifier (a machine learning technique for supervised learning) along with Chi Square (\(\chi ^2\)) feature selection. The multinomial naïve-bayes classifier is used to train our model with labeled training datasets that are classified as positive or negative sentiment. The Chi Square (\(\chi ^2\)) function tests whether the occurrence of a specific string and the occurrence of a specific class is independent. The NER tagger extracts the contexts of classified texts. It labels the sequence of context related strings (e.g., person, location, and organization) in a tweet. After the classification is done, the data distributor sends the results to local disk, the data lake (Hadoop cluster), and the graph storage. Queries to find the comprehensive detail of the results are submitted through SANA’s query interface.

Data Storage Layer: Two different types of storage is integrated in SANA: data lake and graph storage where the results are stored. The data lake is a cluster of nodes where data blocks are distributed. SANA adopts data lake to deal with massive-scale data. The graph based storage of SANA assists to building knowledge graph of classified texts and their contexts.

Presentation Layer: SANA provides a graphical user interface (GUI) which consists of a control panel and a textbox for data visualization. The control panel provides three services. The data collection service calls and loads the data collector. The backend services call processing servers, the graph database server, the coordination server which maintains configuration information, and provides the distributed synchronization service. The query execution service calls and loads the query processor. Lastly, the visualization interface loads the data visualizer and visualizes pie chart that shows the percentage of positive and negative sentiment.

3 Demonstration

SANA is offered as a desktop-based solution and a software as a service (SaaS) on the cloud. Therefore, it provides two different user interfaces: desktop based and web based. In this paper, we describe the former. In the first step, an user starts all the servers by clicking a button called running background services provided in the user interface (We assume that these servers are installed and configured in user’s machine). This starts data acquisition server, processing servers, Haddop cluster, and graph storage server. In the next step, the user starts SANA sentiment analytics application. Upon clicking on start application button a window pops up, the user then selects the application jar file provided by SANA. Once the file is imported, the SANA realtime application starts and the tasks are performed automatically from this point until visualization. SANA’s data collector establishes a connection with the Twitter data center using an authentication API and starts fetching data (the user can view the data collection step on the screen); then it ingests the raw data into SANA’s topology which is essentially the processing logic. Figure 1A shows the topology.

The topology contains: Tweet filter, Tweet classifier, and Tweet NER which perform three tasks, filtering data, sentiment classification, and context extraction. Then, in the next step three tasks are carried out in parallel. First, the consumer sentiments are visualized in a pie chart which shows the percentages of positive and negative views on a concept/product or service which in our demonstration is a land. Figure 1B presents the results produced in every less than a second. The users will observe that sentiment analysis results are updated constantly, as classification is carried out in realtime over the incoming tweets. Second, SANA’s data distributor stores the results in data lake (Hadoop cluster), and the graph storage server. Also the results are stored in local disk. Third, the knowledge graph – consisting of extracted sentiments and their contexts – are visualized by our graph storage. Figure 1C shows the knowledge graph.

Fig. 1.
figure 1

(A) The SANA Topology (B) The percentage positive and negative sentiment (C) The Knowledge graph (D) The Knowledge Graph

Finally, an user might be interested to perform correlated queries to extract more knowledge from the tweets. The user clicks Analysis button, a textbox appears on the screen. Then the user types queries such as, “match (n) - -> 2 with n, count (*) as rel-cnt where rel_cnt> 2 return n.Id n.text Limit 15”. This demonstration query in our demo returned 15 tweets. Each of these tweet contants more than two relations among the nodes that represent context and sentiment. Figure 1D shows the textual representation of the results of the query. We provided a video of this demonstration in here: http://cognitus-research.webs.com which gives more detail.