Keywords

1 Introduction

The term ‘Big Data’ is used to describe large amounts of complex datasets that are beyond the ability of manual techniques to collect, manage, process and analyze within a reasonable time-frame [4]. There is no clear definition for big data. However, it is defined on the basis of its three characteristics ‘3Vs’ [23].

  • Volume: the size of the data such as terabytes (TB), petabytes (PB), etc.

  • Variety: the type of the data. For example, various sources, e.g., sensors, software logs, etc., produce the data in different formats.

  • Velocity: the rate of the data generation per second, minute, hour, etc.

As a part of big data, social information services, e.g., Facebook, Twitter, etc., hold a unique position. Big data generated by these services, i.e., big social data, contains free information such as public opinion of social media users, i.e., social sensors [17]. Moreover, social information services have become a driving force in large scale online data generation. The ‘3Vs’ properties of big data apply to social information services as follows:

  • Social Information Service Volume: Social information services have a growing number of social sensors. According to statista.com, by the end of 2018, there will be 2.44 billion social sensors sharing their data online.

  • Social Information Service Variety: There are six types of social information services [7]. The data shared on these services is unstructured and available in various formats (e.g., audio, video, text).

  • Social Information Service Velocity: These services generate data at an impressive speed [3]. For instance, 500 million tweets, 55 million photos are posted on Twitter and Instagram, respectively per day. 100 h of videos are uploaded per minute on Youtube. On Facebook, 510,000 comments and 293,000 statuses are posted, and 136,000 photos are uploaded per minute.

In our previous work [3], we defined a classification model to demonstrate the diverse data features and sharing mechanisms of social information services. For example, Twitter, a micro-blogging service allows social sensors to share the data as 140 characters long messages, i.e., Tweets. Meanwhile, Youtube, a video streaming service allows social sensors to share and watch videos online. Consequently, the data generated by these services have different features from each other in terms of size, text length, noise, etc.

Despite the presence of ‘variety’, current social information service analysis tools and techniques analyze the big social data as a single entity with similar features. Moreover, current techniques use traditional data oriented approaches, e.g., manual dataset labeling, algorithm training, etc., for information analysis [19]. In addition, many tools are dedicated to a specific social information service (e.g., Twitter). As a result, end users may have to use multiple tools in an ad-hoc manner to analyze various social information services. Using multiple tools is time consuming and provide inconsistent views of social sensors data [22].

In this paper, we propose ‘Big Social Data as a Service’ (BSDaaS) composition framework that extracts big social data from various social information services, and transforms into useful information. Our framework provides a service quality model to present the diverse features of social information services. Unlike existing data oriented techniques, our framework processes each social information service separately. It uses the service quality model to compose multiple services (e.g., preprocessing, information extraction) for each social information services for various analysis applications. We use a social information service based sentiment analysis as our motivating scenario. However, our framework is not limited to sentiment analysis, and can be applied to other applications such as data and text mining. The main contributions of our work are as follows:

  • A service composition framework to process and analyze the big social data. Our framework includes a new formal model to define composite and component services for big social data analysis.

  • A new service quality model that captures the dynamic features of social information services based on the big social data features.

  • A quality model driven service composition method based on graph-planning.

In the rest of paper, Sect. 2 illustrates the motivating scenario. Section 3 highlights the related work. Sections 4 and 5 describes the solution approach and the dynamic composition approach. Section 6 details the experiments and evaluation results. Section 7 provides conclusion and future work directions.

2 Motivating Scenario

Let us assume that ‘Bella’ is a social information service analyst working for the ‘Nation First’ political party. The party is gearing towards an upcoming presidential election. The party is interested in gathering public sentiment about presidential candidates. In addition, the party is interested to visualize the public opinion based on multiple constituencies, i.e., geographical locations.

Fig. 1.
figure 1

Traditional sentiment analysis approach

Bella is given the task to develop a sentiment analysis system to analyze the public opinion from social information services. Bella develops the system by using a traditional data oriented approach (see Fig. 1). Bella combines several components as follows: She first selects various social information services as data sources for public opinion collection. In particular, Bella selects only those social information services which have the geo-tagged data. Secondly, she defines a process model to remove the noise, i.e., irrelevant data, from the collected data. Afterwards, she uses a subset of the collected data to train a machine learning algorithm for information extraction (e.g., sentiment classification, entity extraction). Finally, the analysis results are presented in various formats (e.g., charts, maps). The above approach has following limitations:

  • Many social information services do not provide the geo-tagging facility. Meanwhile, Bella can only use social information services which provide the geo-tagged data. As a result, she may intentionally miss out several potential information sources, i.e., social information services.

  • Social information services provide various functionalities (e.g., video streaming, image sharing) and offer different features (e.g., text length) for data sharing. Thus, the quality of data (e.g., noise level) vary in different services. Hence, one noise removal mechanism can not apply to all.

  • Similarly, one analysis algorithm may not be applicable to all types of social information services. Furthermore, the topics of interest, i.e., data trends, on social information services keep changing. Thus, the algorithm for information analysis requires constant training and validation by Bella.

With the above limitations, the sentiment analysis system may not illustrate the real public opinion. Also, it requires constant involvement from the analyst for algorithm training and validation.

In contrast, BSDaaS processes social information services based on their quality features. It enables the composition of appropriate services, e.g., location extraction, sentiment analysis, etc., based on relevant features for information extraction and integration.

3 Related Work

Researchers have explored the usage of social information services in diverse domains such as business and marketing [10], product design [9], market prediction [15], election campaigns [20]. Social information services provide opportunities to gain insights into human activities based on geographical locations [2]. Spatio-temporal analysis of social information services has been used in flu surveillance [6]. Sentiment analysis is used to extract opinion, attitude and emotion from unstructured data, i.e., text [12]. Sentiment analysis has been used in applications such as FOREX rate prediction, movies revenue prediction, marketing intelligence, business analytics, recommender system, and political analysis [14].

The service oriented paradigm provides the ability to compose services for developing complex applications [11]. There are two types of service composition: functional driven and non-functional driven. In functional driven compositions, the operational properties of services are used to find services [13]. In non-functional driven compositions, the quality of service (QoS) properties (e.g., price) are used to compose services. The ‘Big Data as a Service’ (BDaaS) is a relatively new research domain. Research work [21] proposed a high level of big data as a service architecture. The proposed architecture only provides a set of abstract big data processing components. Similarly, [24] provides an overview of three computing models: Big Data Infrastructure-as-a-Service, Big Data Platform-as-a-Service, and Big Data Analytics Software-as-a-Service. However, these works only provide high-level models of big data as a service, and do not specifically focus on social information service analysis.

There are several tools available for social information service analysis. Some tools are dedicated to specific platforms. For instance, SentimentVizFootnote 1 analyzes Twitter only. Also, there are many tools, e.g., HootsuiteFootnote 2, support the analysis of multiple social information services. These tools are mostly used for customer engagement and content publishing. However, these tools analyze social information services at the same level, and provide simple analyses, e.g., social sensors likes, general sentiment, or trending words. Moreover, such tools lack the flexibility to compose or combine the analysis results across social information services.

4 Solution Approach: Big Social Data as a Service

We divide our solution approach into two parts: (1) System Model (2) Service Model. System model defines the baseline architecture and components. We conceptualize each component as a set of functionally similar services. The data is shared between components via data messages, e.g., XML, CSV. Service model describes the key concepts to visualize BSDaaS. Also, it includes a quality model to capture the dynamic features of social information services. In the following sub-sections, we provide the details of both models respectively.

Fig. 2.
figure 2

Data pipeline architecture

4.1 System Model

The system model defines a data pipeline architecture (see Fig. 2) for big social data analysis. The data pipeline model consists of five data processing components: data collector, data pre-processor, location extractor, data analyzer and integrator. The data pipeline enables simultaneous processing of multiple social information services based on their data features (see Sect. 5). In particular, each social information service is processed by first four components, i.e., data pipeline. Finally, the integrator composes results of each data pipeline. The details of each data processing component are as follows:

Data Collector. The data collector gathers the data from various social information services. Many social information services do not provide a standardized mechanism, i.e., APIs, for searching and extracting the data. For example, Twitter provides search API to find the relevant data based keyword(s) (e.g., Trump, US President). In contrast, many social information services do not provide search APIs. Alternatively, Web scrapers are used for the data gathering.

Data Pre-processor. The data pre-processor removes the noise from the collected data. We use the following four filters for the data pre-processing:

  • Language Filter: Social sensors across share data in different languages (e.g., English). The language filter removes the data based on a given language.

  • Search Terms Filter: Search APIs (e.g., Twitter Streaming API) provide randomly sampled data based on matching keywords. However, the collected data may contain the irrelevant data. Thus, search terms are used to cross validate the collected data.

  • Extra Data Filter: The collected data often contains the unnecessary information (e.g., embedded URLs). For example, the tweet “Angry Ivanka Trump Walks Out Of Cosmo Interview youtu.be/nKxxlftcYyY via @YouTube” contains embedded URL. The extra data filter removes the unnecessary data.

  • Stop Words Filter: Stop words or phrases are used to exclude the data that is irrelevant to the context. For instance, the above tweet contains the keyword ‘Trump’. However, it is not relevant to the context of ‘President Trump’. Thus, ‘Ivanka’ can be used as a stop word to exclude such data items.

Location Extractor. Many social information services do not provide the geo-tagging facility. Moreover, some social sensors intentionally may not expose their geographical information. Therefore, the location information is extracted by checking the mentions of location names (e.g., cities). The location extractor first parses the data, i.e., text, to identify locations. Secondly, it uses the geo-coding technique to assign geo-coordinates, i.e., longitude and latitude, for extracted locations. The data without location information is eliminated.

Data Analyzer. The data analyzer consists of multiple data analysis algorithms, i.e., services, to extract the required information (e.g., sentiment, emotion). There are various commercial and non-commercial services available online which are specialized in analyzing the data for various domains (e.g., politics). The benefit of using online services is that these services are maintained and updated by service providers. Furthermore, these services also provide several non-functional features (e.g., price) to fulfill the QoS preferences.

Integrator. The integrator composes the results from multiple data pipelines based on the spatio-temporal requirements. Let us assume that M is a matrix of j number of data pipelines. A data pipeline \(P_{j}\) is comprised of four data processing components: data collection \(C_{j}\), data preprocessing \(R_{j}\), location extractor \(L_{j}\), and data analyzer \(A_{j}\). Each component has k number of candidate services. In matrix M, each tuple represents a data pipeline. A data pipeline \(P_{j}\) may be composed of multiple candidate services from each data processing component. Equation 2 presents the composition of data processing components for a data pipeline \(P_{j}\) of social information service SN.

(1)
$$\begin{aligned} \begin{array}{lll} P_{j}(SN_{j}) = {\mathop {\sum }\nolimits _{i=1}^{k}}j.{[C_{ci}^{ck} + R_{ri}^{rk} + L_{li}^{lk} + A_{ai}^{ak}]} \\ \end{array} \end{aligned}$$
(2)

It is possible that a data pipeline may provide results in various formats or data types. The integrator component incorporates the analysis results of all data pipelines into a cohesive format based on spatio-temporal features. Equation 3 shows the composition function Com(ST) for pipelines integration, S and T presents the spatio-temporal parameters.

$$\begin{aligned} \begin{array}{ll} Com(S, T) = {\mathop {\sum }\nolimits _{j=0}^{n}}{P_{j}(SN_{j})} \end{array} \end{aligned}$$
(3)

4.2 Service Model

The service model first defines a formal model to represent the component and composite services. Secondly, a service quality model is presented that captures the big social data features of social information services.

Component and Composite Service Model. We use a top-down approach to define the service model. First, we define the BSDaaS as a composite service. Later, we determine each data processing component as a service.

  • Definition 1. The Big Social Data as a Service (BSDaaS) is a composite service. It is a tuple of \({<}ID, AP, AF_{i}, Q_{i}{>}\), where

    • ID is the service id.

    • AP is the type of analysis application (e.g., sentiment analysis).

    • AFi is set of functions (e.g., sentiment polarity) provided by application.

    • Qi is a set \({<}q1,q2,\ldots ,qn{>}\), where \(q_{i}\) denotes a QoS feature (e.g., price).

  • Definition 2. The data collection service DCS collects the data from social information services. DCS is a tuple of \({<}cid, k, ts - te, SN, q_{i}{>}\), where

    • cid is the service id.

    • k present a set of keywords to search the relevant data.

    • ts-te present temporal bounds ts/te (start/end) time for data collection.

    • SN is the social information service for the data collection.

    • qi presents the set of QoS properties (e.g., response time).

  • Definition 3. The data pre-processing service PRS applies remove different types of noise. PRS is a tuple of \({<}pid, fln, dp, qi{>}\), where

    • pid is the service id.

    • fln describes a set of appropriate noise removal filters.

    • dp presents the data features for noise removal filter(s) selection.

    • qi defines the QoS properties (e.g., accuracy).

  • Definition 4. The location extractor service LES extracts the geographical locations and geo-tag the data. LES is a tuple of \({<}lid, flc, qi{>}\), where

    • lid is the service id.

    • flc extracts locations from text and geo-tag the data by using a gazetteer.

    • qi is a set of QoS properties (e.g., throughput).

  • Definition 5. The data analyzer service DAS extracts various types of information from the data. DAS is a tuple of \({<}aid, inf, dap, ln, qi{>}\), where

    • aid is the service id.

    • inf defines the subjective information (e.g., sentiment, entities) required from the data.

    • dap defines the data features require for analysis service selection.

    • ln presents the ability to perform the analysis on multiple languages.

    • qi determines the set of QoS properties (e.g., precision).

  • Definition 6. The data integrator service DIS composes the analysis from multiple data pipelines. DIS is a tuple of \({<}npl, space- time, df{>}\)

    • npl defines the set of data pipelines aggregated for final composition.

    • \(space - time\) space and time clusters the large data into small segments based on location (e.g., states, cities) and time.

    • df presents the final results in various formats (e.g., maps, table, charts).

Social Information Service Quality Model. We define an extensible quality model to capture various quality features by using existing domains, i.e., sensor cloud computing, data quality assessment.

Data Volume. The data volume \(V_{N}\) determines the quantity of the collected data (e.g., tweets, comments) [8]. The data volume is useful to predict the amount of steps (e.g., time, space) required for the data processing. The data volume for social information service SN is calculated as following function:

$$\begin{aligned} f_{Volume}(SN)= V_{N} \end{aligned}$$
(4)

Data Richness. Social information services have rich data due to the contribution of multiple social sensors [1]. The high number of social sensors determines the confidence in final analysis. The richness \(R_{Sen}\) defines the number of unique social sensors. It is used to select or reject a dataset based on satisfactory participation of social sensors. The data richness is calculated as following function:

$$\begin{aligned} f_{Richness}(SN)= R_{Sen} \end{aligned}$$
(5)

Data Freshness. The freshness implies that the data is recent and does not contain old data [8]. The freshness \(F_{R}\) determines the data of social sensors with respect to temporal properties. The freshness is calculated as below:

$$\begin{aligned} f_{Freshness}(SN)= F_{R} \mapsto \int _{ts}^{te} \varDelta \end{aligned}$$
(6)

where ts is the time, i.e., time stamp, of oldest data item, and te defines the time of latest data item in the dataset \(\varDelta \).

Data Mode. Social sensors share data in two modes: direct and indirect. In direct mode DI, social sensors share the data by mentioning the topic. For example, the tweet “Nice... Trump offers Door Prize to entering illegals....” contains the direct mention of the topic ‘Trump’. In indirect mode IN, social sensors respond to a content (e.g., video) and may not mention the topic. Thus the both data modes requires different noise filters. The data mode is defined as follows:

$$\begin{aligned} f_{Mode}(SN)= \left\{ \begin{array}{ll} if(C_{SN}==M), D_{Mode} \mapsto DI\\ Otherwise, D_{Mode} \mapsto IN \end{array} \right. \end{aligned}$$
(7)

where \(C_{SN}\) is the data content collected from social information service, M is the multimedia data content (e.g., images, videos).

Geo-Spatial Data. Spatial information is used to monitor the data by geo-locations [5]. The geo-spatial data \(G_{Data}\) is decided based on the ability of social information service to provide the geo-tagging facility. The geo-spatial data property is determined as follows:

$$\begin{aligned} f_{Geo}(SN)= \left\{ \begin{array}{ll} if(SN==G_{Data}), G_{Data} \mapsto True\\ Otherwise, G_{Data} \mapsto False \end{array} \right. \end{aligned}$$
(8)

Text Type. There are two types of online text writing styles: formal and informal [18]. In informal style \(T_{Inf}\), the text is short and contains Internet language (e.g., abbreviations, slang). In formal style \(T_{Fo}\), the text is written by using proper language skills. Information extraction from both types requires different text analysis techniques. The text type \(T_{Type}\) of data item \(d_{i}\) is classified follows:

$$\begin{aligned} f_{TextType}(SN)=\forall d_{i}\in (SN):\{T_{Type}\mapsto T_{Fo} | T_{Inf}\} \end{aligned}$$
(9)

5 Quality Driven Service Composition Approach

We define a social information service quality driven service composition approach based on GraphPlan. Graph-Planning is a constraint-based technique which uses states, actions, pre, post, and negative constraints for task planning [16]. Our approach uses the service quality attributes as constraints for task planning. The proposed approach is defined in following three sections.

5.1 Composition Model

The core concepts of GraphPlan composition model are explained as follows:

  • Goal: The goal \(G_{BSDaaS}\) of planning problem consists of four sub-goals: \(G_{BSDaaS}\)={\(G_{DCS}, G_{PRS}, G_{LES}, G_{DAS}\)}, where \(G_{DCS}\), \(G_{PRS}\), \(G_{LES}\) and \(G_{DAS}\) presents the sub-goals of data collection, data preprocessing, location extraction, and data analysis, respectively in a data pipeline.

  • The Planning Problem: for composition process is denoted as P. It is comprised of three elements: state transition \(\varSigma \), initial state \(I_{S}\) and the goal \(G_{BSDaaS}\), defined as \(P=\{\varSigma , I_{S}, G_{BSDaaS} \}\).

  • State Transition: \(\varSigma \) consists of three sub-elements: set of states S, set of actions A and set of constraints \(C_{s}\), defined as \(\varSigma =\{S, A, C_{s} \}\).

  • State: \(S_{i}\) consists of n number of tasks \(T_{n}\), denoted as \(S_{i}=\{T_{1},T_{2},..,T_{n}\}\). For example, data preprocessing consists of several noise removal tasks. For each task, there are m candidate services \(T_{i}=\{ws_{1},ws_{2},\ldots ,ws_{m}\}\) available.

  • Action: A create a respective task in the state transition based on constraints.

  • Constraint: \(C_{s}\) is set of conditions adjacent to set of actions that should be true or false before an action creates a task.

  • Task Simulation: In GraphPlan, task simulation \(F_{n}(T)\) simulates tasks based on conjunctive actions and constraints as \(F_{n}(Ti)=\{A_{i} \mapsto C_{si} \}\). For example, an action DataCollection and respective constraint Twitter create the task of TwitterServiceSelection in the planning process.

  • Trivial Solution: Based on above concepts, each sub-goal is achieved by decomposing into states and respective tasks. A solution to P exists; if and only if all goal states are intersected with a set of tasks reachable via initial task \(T_{I}\); and final result set must not be empty: \(G_{BSDaaS}\cap T_{n}^{>} (\{T_{I}\})\ne \{\}\).

figure a

5.2 Graph-Planning Algorithm

The proposed algorithm is extended based on GraphPlan technique [16]. In our method, we limit the algorithm to Forward-Search in GraphPlan to find a solution for data pipeline composition. The Algorithm 1 is comprised of two basic steps. First, it provides graph planner an initial graph \(I_{S}\) with set of tasks and an initial set of constraints. Afterwards, it expands the solution graph based on the quality model constraints for each sub-goal which may include a solution. Secondly, if there is a solution available, the algorithm extracts it from the graph.

Fig. 3.
figure 3

Facebook data pipeline composition graph

Line 1 provides graph planner four inputs: (1) index i for the graph layer. (2) MT a master constraint table which includes the set of actions and respective constraints. (3) an initial constraint \(C_{s0}\). (4) an initial graph G layer for the corresponding tasks. In lines 2 to 7, with the initial input, the algorithm continues to expand the graph until all of the constraints are met for the current layer, i.e., sub-goal, or there is no solution exists. Meanwhile, after the solution is found for the current layer, the algorithm repeats the process for next layer based on the respective action and constraints from master table. The graph is expanded by adding solution for each layer. Finally, lines 8 to 10 set the termination condition by checking the solution graph size after each layer processing; if the size does not change, the algorithm returns failure and aborts the whole process. Finally, line 11 returns the composition graph plan.

5.3 Service Composition Plan Generation

We use graph planning based data pipeline composition as an example scenario (see Fig. 3). The graph planner starts the process by initiating the graph with four inputs: index i, constraint table MT, initial constraint \(C_{s0}\) and graph G. \(C_{s0}\) defines the initial constraints (e.g., volume, richness). MT contains four sets of constraints for four sub-goals based on service quality mode: data, preprocessing, location and analysis. For sub-goal \(G_{DCS}\), the data collection task is initiated against the respective social information service in data pipeline, i.e., Facebook. For current state transition, three possible data constraints are validated based on \(C_{So}\) for data volume \(V_{N}\), data freshness \(R_{Sen}\) and data richness \(F_{R}\), before moving to next sub-goal. For data preprocessing \(G_{PRS}\) and location extraction \(G_{LES}\) sub-goals, preprocessing and location constraints are validated by using \(f_{Mode}\) and \(G_{Data}\). In MT, for data mode DI all noise filters are selected. Meanwhile, for IN, search term filter is not included. In addition, social information services are marked True and False for their geo-tagging facility. As Facebook does not provide geo-tagged data, thus location extraction task is created for \(G_{LES}\) and state transitioned into \(G_{DAS}\). Finally, the analysis constraint, i.e., text type \(T_{Type}\), is used to create data analysis service selection task for \(G_{DAS}\).

Table 1. Summary of dataset

6 Experiments and Evaluation

We implemented a prototype for our motivating scenario. We collected the approval ratings of American president ‘Donald Trump’ through sentiment analysis. The data is collected between 19-March-2017 to 26-March-2017 from: Facebook, Twitter and Youtube (see Table 1). We evaluate our framework over three sets of experiments: (1) We evaluate the performance of data collection, preprocessing and location extraction components. (2) We evaluate the effectiveness of data analysis component by three sentiment analysis services with human annotated data. (3) We provide the details of cost analysis for data pipeline integration.

Fig. 4.
figure 4

(a) Social sensors to data volume ratio (b) Data timeline

Table 2. Evaluation of data filtering results

6.1 Evaluation of System Components

First, we use data volume, data richness and data freshness, to evaluate the data collection component. Secondly, we evaluate data preprocessor and location extractor by using data mode and geo-spatial data properties.

The data volume \(V_{N}\) is measured by the data items collected from each social information service. For data richness \(R_{Sen}\), unique social sensors are identified based on platform generated IDs. Figure 4a provides the ratio of unique social sensors with respect to data volume. For data freshness \(F_{R}\), the collected data is sorted by the time and date. Initially, the data collection dates are used as temporal bounds. The data that does not comply with temporal requirements is discarded. Figure 4b presents the data time-line with 7 days of interval.

For data preprocessing, data mode \(D_{Mode}\) is determined. The data collected from Facebook and Youtube is generated in response to videos and images. Thus, the data mode for both social information services is classified as indirect IN mode. In contrast, the data collected from Twitter is classified as direct DI mode. Therefore, we used different sets of search terms and stop words for each social information service for data filtering. For location extraction, Facebook and Youtube do not provide the geo-tagged data for public access, thus the \(G_{Data}\) is determined as False. To extract geo-locations from Facebook and Youtube, we used the Stanford NERFootnote 3 (Named Entity Recognizer). In contrast, Twitter provides the geo-tagged data for public access. Hence, the \(G_{Data}\) for Twitter is determined as True. For Twitter, we filtered out all the non-geo-tagged data. We used Signal-to-Noise Ratio (SNR) as an evaluation metric. For language filter, we discarded all non-English data. Table 2 shows the filtering resultsFootnote 4.

Table 3. Evaluation of sentiment analysis services
Fig. 5.
figure 5

Average data integration throughput

6.2 Evaluation of Data Analysis Component

We evaluate the text type \(T_{Type}\) before data analysis by using a binary classifier. Next, we analyze the data with three sentiment analysis services.

The binary classifier labels the data items into two text categories: Formal and Informal. The binary classifier is trained with a subset of Internet slang (e.g., lol, omg) taken from Internet slang dictionaryFootnote 5. The classifier provides the classification results Formal to Informal Ratio for Facebook, Twitter and Youtube: 1:0.07, 0.29:1, 1:0.08, respectively. Twitter has the highest ratio for informal data, while Facebook and Youtube has lowest ratio of informal data.

For data analysis, we used three sentiment analysis services: Alchemy-API, Microsoft Text Analytics API, and Senti-Strength. Former two APIs are used to analyze formal text. While later is designed to analyze the informal text. For evaluation, a sub-set of each social information service is manually annotated by human users into three categories: Positive, Negative, and Neutral. Based on text classification, Facebook and Youtube data is analyzed with Alchemy-API and Microsoft Text Analytics API, respectively. Twitter data is analyzed with Senti-Strength. We used three evaluation metrics: Accuracy AC, Precision PP, Recall PR. Table 3 provides the evaluation details of sentiment analysis. The evaluation metrics is defined below:

  • Accuracy AC = \(\frac{AR}{TR}\), TR is the number of all data items and AR shows correctly classified reviews, e.g., positive.

  • Recall PR = \(\frac{PC}{TP}\), calculates the accuracy of one type of reviews, e.g., positive. TP is the number of all positive reviews and PC is the number of correctly classified positive reviews.

  • Precision PP = \(\frac{PC}{PC+PW}\), calculates only one type of correct classification, e.g., positive. PC is the number of originally labeled positive reviews and PW shows the reviews that wrongly classified as positive.

6.3 Evaluation of Data Integration Component

The analysis services for each data pipeline provide results in different formats. For instance, Senti-Strength rates the sentiment polarity between (\(\pm 1\) to \(\pm 5\)). Meanwhile, Microsoft Analytics API grades the sentiment polarity from (0% to 100%). Thus, analysis results are normalized before final integration as follows:

$$\begin{aligned} z_{i} =\frac{x_{i}-min(x)}{max(x)-min(x)} \end{aligned}$$
(10)

where \(x=(x1,\ldots ,xn)\) are the polarity scores and \(z_{i}\) is the ith normalized value. Throughput is used as an evaluation metric to measure the total data items processed by the data integrator. Figure 5 shows the average integration time.

Fig. 6.
figure 6

(a) Sentiment by location (b) Composite sentiment analysis results

For the data visualization, we used ‘USA’ as a spatial parameter. Figure 6a shows the sentiment by using Google Maps based on social sensors geo-locations. Figure 6b shows the overall sentiment analysis results are composed in a bar chart. Social sensors on Twitter have highest percentage of negative and lowest positive sentiment for US president. In comparison, social sensors on Facebook and Youtube have almost equal percentage of negative and positive sentiment.

7 Conclusion and Future Work

We defined and implemented a service composition framework that extracts big social data from social information services, and transforms it into meaningful information. We proposed a novel service quality model that captures the heterogeneous features of social information services. Our framework includes a data pipeline infrastructure that uses the service quality features to compose multiple services for various social information services. We also devised a quality model driven service composition algorithm based on graph-planning. In contrast to traditional data oriented approaches for social information service analysis, we used the notion of service orientation to process and analyze social information services. The service orientation provides the flexibility for composing services based on the quality features of social information services. We implemented a prototype of our framework and conducted experiments on real dataset. The results show the efficiency of our framework for high volume data. In future, we are interested to extend the service quality model with social sensors features (e.g., age, race). We also aim to compare our approach with existing techniques.