Keywords

1 Introduction

Social media platforms such as Facebook, Twitter, YouTube, etc., have emerged as free sources of publicly available information [6]. A large number of social media users (i.e., human sensors, social sensors) share their opinions on various topics, e.g., politics, sports, movies, etc. Initially, social media platforms were recognized as social networking sites where social sensors interact with their peers. However, the growth of social sensors and the amount of data have encouraged researchers and practitioners to utilize social media in various domains such as business, product design, disaster management, health surveillance, politics, etc. Thus, social media platforms are often characterized as data sources for various applications [22].

Social media platforms has six categories: collaborative projects, blogs and micro-blogs, social networking sites, content communities, virtual game and social worlds [9]. Although, these platforms are primarily used for the online data sharing, they differ in terms of social sensor participation and sharing mechanisms [4]. A typical example is the difference between Twitter and YouTube. Twitter is a micro-blogging site which allows its users to share their opinions by using instant short, 140 characters long messages, i.e., Tweets. Similarly, YouTube is a video sharing site which allows its users to share videos on diverse topics. In response to the video, the viewers may share their opinions, i.e., comments. Moreover, different topics such as politics, entertainment, health, etc., may also affect the data features (e.g., size, frequency, noise). Regardless of such differences, current social media analysis tools treat social media platforms as a uniform data source with similar properties [25]. In addition, current tools lack the flexibility to differentiate between social media platforms based on their diverse features. Thus, we argue that in order to efficiently analyze social media data and subsequently draw correct conclusions; it is essential to understand the diversity of social media platforms and their behaviors.

This paper presents a novel approach for modeling and analyzing social media platforms as services, i.e., social information services. Generally, services are conceptualized as autonomous software systems which can be accessed over the Internet through programmatic interfaces [27]. In addition, services are visualized as black boxes where service implementation is hidden from end users [18]. Services consist of two properties: functional and non-functional. Functional properties define the operations offered by a service. Non-functional properties determine QoS (Quality of Service), e.g., price, response time, etc. However, traditionally services are utilized by various applications, and have no direct interaction with end users. In comparison, social information services have traits that differentiate social information services from traditional services: (1) Social information services are directly utilized by social sensors. (2) Social information services have enormous numbers of users, and produce a large data on multiple topics at fast speed. (3) The data produced by social information services have diverse features, e.g., size, noise, etc. Thus, these distinct properties give rise to the following questions:

  • How to model social information services? What are the functional and QoS properties of social information services?

  • How do different domains, i.e., topics, affect the functional and QoS properties of social information services?

In this paper, we answer and explore the above questions. To the best of our knowledge, our work distinguishes from previous efforts by being the first to utilize the notion of service orientation to analyze social media. We aim to abstract functional and QoS properties of social information services. Moreover, our research is an important first step in understanding the requirements to process and analyze various social information services based on their functional and QoS properties. The contributions of our work are as follows:

  • We present a formal service model to present social media platforms as social information services.

  • We propose a classification model for the functional and non-functional properties of social information services.

  • We collect the data from three social information services related to three different domains: Politics, Entertainment, Health. We analyze and quantify results based on multiple functional and non-functional properties.

The rest of the paper is organized as follows. Section 2 illustrates the motivating scenario. Section 3 highlights the related work. Section 4 describes the details of the analysis methodology. Section 5 provides the details of our findings through experimental results. Section 6 provides the detailed discussion and direction for research challenges. Section 7 concludes the paper.

Fig. 1.
figure 1

Service composition for social information services analysis

2 Motivating Scenario

We use social information service based opinion mining and sentiment analysis as our motivating scenario. Let us assume that ‘Bella’ is an analyst working for the Department of Social Services (DSS) and the Department of Health Services (DHS). The DSS needs to analyze public opinion to understand general public views on various political and policy issues. On the other hand, the DHS is interested in detecting ‘Influenza’ outbreaks and its severity in multiple cities. In this scenario, Bella has to plan and develop social information service based analysis system for both stakeholders, i.e., Departments. In addition, due to cost and time constraints, both Departments do not want to develop their systems from scratch. Therefore, Bella needs to use a combination of off-the-shelf and online components, e.g., APIs, services, etc., that suit the requirements.

In this scenario, Bella needs to define different strategies for both Departments. For DSS, as a first step, she needs to select a set of social information services for data gathering, which are more suitable for political discussions. Secondly, she needs to select tools for the data pre-processing, e.g., noise removal, etc. Finally, she needs to select sentiment analysis services specialized in the domain of politics to extract relevant opinions and sentiments. It is quite possible that a single service for pre-processing and opinion analysis may not be applicable to all social information services. As multiple social information services have different properties, e.g., data types, data size, etc. Thus, Bella may need to select multiple tools according to each social information service.

Meanwhile, for the DHS, Bella needs to change her initial strategy. First, she selects a set of social information services which are suitable for health information sharing. Also, Bella is required to choose only those social information services which have the information of social sensors geographical locations, i.e., cities. Alternatively, she may select services that can extract the location information from data. Later, similar to her first strategy, she requires appropriate services for the data pre-processing. This is followed by the selection of opinion analysis services specialized for health surveillance to extract the information. Finally, the best available services are composed for the end users (see Fig. 1).

In order to compose appropriate services, Bella requires to differentiate between social information services and needs to understand how to process them. In this scenario, Bella will use our proposed QoS properties as a blueprint to select social information services, preprocessing services and analysis services for composition. In a typical social information service based sentiment analysis tool, all the data sources are processed and analyzed as a collective similar entity. In comparison, the functional and QoS properties will help to anticipate and understand the requirements to gather data, process and analyze selected social information services on individual bases. Moreover, it will help to define trade-offs for performance and budgetary allocations such as cost and time.

3 Related Work

Social information services have become a popular tool for online communication, data and opinion sharing [2]. Everyday a large number of social sensors generate and share the data on various social information services. The data, e.g., comments, tweets, posts, etc., shared on social information services is unstructured. One of the mechanism to abstract the useful information from the textual data is sentiment analysis [21]. Sentiment analysis is a process to classify text into three categories: positive, negative and neutral [13]. However, in the broader picture, sentiment analysis or opinion mining is used to extract public opinion, attitudes and emotions towards an entity from text [16]. As a result, social information service based text analysis has emerged as a new research domain. Many researchers have explored the usage of social information services in diverse domains such as business and marketing [14], product design [12], health surveillance [20], stock market prediction [23], emergency and disaster management [7]. However, current techniques use traditional data oriented approaches, e.g., manual dataset labeling, algorithm training, etc., for information extraction and do not distinguish between social information services.

The service oriented paradigm has emerged as an architectural pattern that uses services as building blocks to build new applications [5]. The service paradigm provides the powerful abstraction that hides the implementation specific information and focuses on how to use services [19]. As a result, end users are only concerned with the inputs and outputs of the services. Services provide the capability to develop complex systems by composing existing services [15]. There are two types of service composition: Functional driven, non-functional driven. The functional driven service composition uses the operational or functional properties to find and compose services [17]. Non-functional driven service composition focuses on satisfying the QoS requirements of end users.

There are many online tools available that are used for social information service monitoring and analytics. For example, SentimentVizFootnote 1 provide the analysis for Twitter only. There are several tools (e.g., SocialMentionFootnote 2, HootsuiteFootnote 3) available that analyze more than one social information service. These tools support several social information services for multiple activities. However, these tools are mainly used for multi-platform customer engagement, scheduling and publishing of the contents. In addition, these tools analyze social information service at same level, and provide simple analyses such as user clicks, likes, general sentiment categorization, etc.

In this paper, we propose to utilize the service oriented paradigm to analyze social media platforms as services. Our aim is to explore and evaluate the functional and non-functional properties of social media. Moreover, we intend to provide a blueprint for composing social media platforms as services in various analysis scenarios on the basis of their functional and non-functional properties.

Table 1. Dataset details (by topic)

4 Methodology

To investigate the diversity of social information services, we answer our two research questions by using following methodologies.

4.1 Data Collection

We first collected the data from three types of social information services: Facebook, Twitter, Youtube. The data is collected through open source tools. For Facebook and Twitter, FacepagerFootnote 4 is used to collect the data. Facepager uses ‘Graph API’ to collect Facebook posts and comments from public Facebook pages. For Twitter, Facepager uses ‘Twitter Streaming API’ to get the random tweets for given set of keyword(s). For Youtube, online scraping is used to collect the comments from the videos. We have used three distinct topics: Politics, Entertainment, Health, for the data collection (see Table 1). The dataFootnote 5 is collected for a period of 7 days from 19-March-2017 to 26-March-2017. Following is the detail of each topic:

  • For politics, American president ‘Donald Trump’ is selected as a topic of interest. The collected data is based on social sensor conversations from Tweets, Youtube videos comments, and Facebook posts from Donald Trump’s official Facebook page.

  • For entertainment, the newly released movie ‘Beauty and the Beast’ is selected as a topic of interest. The data is collected from Twitter, Facebook posts from official Facebook page of the movie, and Youtube videos comments from fan based movie reviews videos.

  • For health, ‘flu’ is chosen as a topic for the data collection. The data is collected from Twitter, Center for Disease Control Facebook page and Youtube videos comments.

4.2 Solution Modeling

In order to answer our research questions, we adopt the following methodologies and perform experiments on the collected data.

RQ1. How to model social information services? What are the functional and QoS properties of social information services?

We first define a formal service model to present social information services. Secondly, we classify social information services based on their generic functionalities, and generalize the functions for each class of social information services based on their data. In addition, we propose a QoS model that collectively captures the non-functional properties of social information services.

RQ2. How different domains, i.e., topics, affect the functional and QoS properties of social information services?

We hypothesize that the QoS properties of a social information service may behave differently for diverse topics. Moreover, it is quite possible that social information services may not possess similar QoS properties. To test our hypothesis, we analyze the data from three social information services for three diverse topics: Politics, Entertainment, Health.

5 Findings

In this section, we present our findings according to the proposed methodologies for each research question.

5.1 Service Orientation of Social Media

In this section, we present the answer to our first question (RQ1). First, we define a formal service model. Second, we present a functional classification of social information services and propose a QoS model.

Service Model. A social information service SIS is defined as a tuple of five elements: \({<} ID, P, C, DT, Q {>}\), where

  • ID is the service id.

  • P is the actual social information service, e.g., Twitter.

  • C is the classification or type of the social information service.

  • DT defines the type of data provided by the social information service.

  • Q is a set \({<} q1,q2,...,qn {>}\), where \(q_{i}\) denotes a QoS feature.

Functional Model. The functional model defines two properties of social information services: Service Classification C, Data Type DT. There are multiple social information services available with similar functionalities. The service classification C is defined based on the generic functions performed by a social information service. The data type DT determines the data (e.g., text) generated by a social information service. Table 2 summarize the functional classification. We propose following classification based on generic functionalities:

Content Communities. Content communities CS (e.g., Youtube, Instagram, LiveLeak) are used to share multimedia data such as videos, audios, images, etc. Social sensors upload their data, while other social sensors respond with descriptive data, e.g., comments, likes, dislikes, etc.

Social Networking Sites. Social networking SNS (e.g., Facebook, Wechat, LinkedIn) are Web applications which provide social sensors to create community networks. Social sensors create personalized profiles and specific communities to invite friends, colleagues and family members. The data on SNS is shared via personal images and videos, text messages and status posting. Social sensors with the access to the community can respond with descriptive data.

Blogging Sites. Blogging sites BLS consists of blogs and micro-blogs. Blogs are special types of Web-sites (e.g., HuffingtonPost, HelpScout) or online forums (e.g., Reddit, Quora) that are used to share textual data such as questions and Answers, articles, etc., with public. While, micro-blogging is a new derivative of blogs. Micro-blogging Web-sites (e.g., Twitter, Tumbler) facilitate sharing of instant information with friends, family and public with small text messages. General public can also respond with descriptive data.

Table 2. Social information service functional classification

Quality of Service Model. In service oriented paradigm, the QoS describe the non-functional properties of a service. The QoS provides the leverage to select and compose services based on quality features. In our hypothesis, we determine the QoS of social information services based on their data sharing behaviors. We propose a QoS model by using existing research domains, i.e., sensor cloud computing, data quality assessment. The proposed QoS model is extensible. We are currently using the following properties:

Relevancy. Relevancy defines the extent to which the data provided by a social information service is applicable to a given topic [10].

Richness. The social information service data is inherently diverse and rich due to the involvement of multiple social sensors [1]. Richness defines the participation of unique social sensors sharing their data.

Freshness. The freshness implies that the data is recent and does not contain any old data [10]. Freshness determines the data posted by social sensors with respect to the time, i.e., data time-lines.

Spatial Information. Spatial information is deemed very useful to visualize the data based on geo-locations [7]. The spatial information property shows the ability to extract the geographical information from the social sensor data.

Text Type. There are two genre of text writing in online communication: formal and informal [24]. In informal style, the written text is largely influenced by the Internet language which contains abbreviations, slang, emoticons, etc. In formal style, social sensors use proper language and grammatical skills and avoid Internet language. The text type classifies text into two types of text styles.

Lexical Diversity. Social information services often restrict social sensors with text length by limiting the number of characters to be used in the text. The lexical diversity [3] determines the lexical heterogeneity of a piece of text in terms of the number of characters, words and sentences.

Meta-data Properties. Social information services offer social sensors a range of options for responding to a content shared by other social sensors. These options present the associated descriptive data, i.e., meta-data, of the shared contents [26]. The meta-data present the number of likes, shares, replies, etc.

5.2 Functional and Non-functional Analysis

In this section, we answer our second question (RQ2). We analyze the functional and QoS properties of our dataset with respect to three different topics, and present the analysis results.

Functional Properties Analysis. We analyze and compare the functional properties of each social information service in our dataset with respect to three topics: entertainment, politics and health. Table 3 presents the analysis results. We breakdown our dataset into 7 days interval. For each day, the number of shared data contents and respective responses are recorded. The analysis shows that Facebook and Youtube have notable disparity in the data for different topics. In comparison, Twitter remains consistent with the flow of the data. One reason for Twitter’s consistency is the mechanism provided for collecting tweets by using ‘Streaming API’. Unlike, the other two social information services which do not offer such standardized mechanism for data collection. Also, it is observed that social sensors share limited data on Youtube on health topic.

Table 3. Functional analysis of social information services by topic

Non-functional Properties Analysis. For each of the proposed non-functional properties, we applied various techniques for conducting experiments.

Fig. 2.
figure 2

Irrelevant data in three social information services by topic

Relevancy. To remove the irrelevant data from our dataset, the inclusion and exclusion technique [8] is used. We have used following three filters:

  • Language based filter: is used to identify the data, based on a language. We filtered out all non-English data.

  • Search terms based inclusion filter: is used to include the data based on keywords. For example, “Trump”, “US President” are used to include the relevant data items for politics.

  • Stop words\(\backslash \)phrases based exclusion filter: is used to exclude the data that is irrelevant. For example, a data item containing “Ivanka Trump” may contains the keyword “Trump”. However, it is not relevant to the context. Thus, “Ivanka” is used as a stop word to exclude such data items.

For Twitter, we used the above three filters. We find that the Twitter’s data contains a large amount of irrelevant data. One main reason is that the data provided by Twitter is randomly sampled based on keyword matching rather than contextual matching. For instance, the health data contained a large number of tweets with term “Beiber Fever” by singer Justin Beiber’s fans. Thus, much of the data is discarded by using stop words. Moreover, the large bulk of the Twitter’s data is filtered out due to being written in non-English.

For Facebook and Youtube, we used the 1st and 3rd filters. Unlike Twitter, the data is directly posted by social sensors in response to the content (e.g., video). For example, the comment “Awesome Movie:)” posted on the ‘Beauty and the Beast’ Facebook page has no direct search term in the text, yet it is still relevant to the context. Also, a larger portion of Facebook and Youtube is written in English language. Consequently, their relevancy is higher than Twitter. Figure 2 provides the results of irrelevant data percentages of each service by topics.

Richness. Social information services attract various ranges of social sensors based on their generic functionalities. For each social information service, the participation of unique social sensors may vary with respect to a topic. In our analysis, we found that a large number of social sensors are more active on official Facebook pages. In addition, Facebook has a higher number of participants for politics and entertainment than Twitter and Youtube. However, Twitter has more participants for health data than Facebook and Youtube combined. Unexpectedly, Youtube has the lowest participants, in terms of comments sharing for all topics. Figure 3 presents the number of unique social sensors by topic for three social information services.

Fig. 3.
figure 3

Unique social sensors in social information services by topic

Fig. 4.
figure 4

Data time-line in social information services by topic

Freshness. Social sensors share the data in two modes: stream and non-stream. In stream mode, social sensors actively share the data related to recent events. In non-stream mode, social sensors passively share the data in response to on-going events. We visualize the freshness of our dataset by time slicing with a time-line of 7 days. Unsurprisingly, Twitter shows almost consistent streams of the data for three topics by each day segment. In contrast, the data on Facebook and Youtube is generated in response to a content. It has shown high variations for topics with respect to each day segment. Figure 4 presents the data shared within a 7 day time-line by topic.

Spatial Information. The location information enables to understand the origin of the data based on social sensor locations. Many social information services (e.g., Twitter) provide the facility to geo-tag the data. However, many platforms do not provide such a facility. Therefore, the location information is extracted through text parsing by checking the mentions of the location names. For Facebook and Youtube, we use the Stanford NER (Named Entity Recognizer) for location names extraction. The Precision, Recall and F-Measure of Stanford NER with baseline training dataset is 0.699, 0.682 and 0.691, respectively [11]. Although, Stanford NER is a useful tool for location extraction; sophisticated algorithms are required to determine that the exact location names from text. For politics, all social information services have the highest number of geo-location information. Meanwhile, for health, all services have the lowest geo-location information. Table 4 presents the spatial information of Twitter based on embedded geo-coordinates, and Facebook and Youtube based on Stanford NER.

Table 4. Signal to Noise Ratio (SNR) of relevant data
Table 5. Formal to informal data ratio

Text Type. We used a binary classifier to label the data into two categories: Formal and Informal. Our classifier uses a subset based of an Internet slang dictionaryFootnote 6. It contains trending acronyms and abbreviations (e.g., lol, IMHO, omg) that are used in social information services, chat rooms, blogs, and Internet forums. We find that Twitter has the highest ratio for informal data for all of the topics. Social sensors tend to use more slang, abbreviations and special characters in Twitter. Meanwhile, social sensors of Facebook and Youtube are more likely to post their data in formal style. Table 5 presents the classification ratios by text types for three social information services.

Lexical Diversity. We analyze the data of social information services based on three measures: characters, words, and sentences, by using three stats: minimum, maximum, and average. Before processing our dataset, we first strip all “online-specific” markup (e.g., hashtags, user mentions, emoticons, special characters). Secondly, we also accommodate out-of-vocabulary (OOV) words [1], rather using the pure dictionary based technique which discards misspelled words, (i.e., slangs, abbreviations). In our analysis, we find that due to 140 characters limit, the lexical properties of Twitter are almost consistent. In contrast, Facebook and Youtube have mostly similar results for ‘minimum’, where social sensors may only post a simple character (e.g., ‘D’, ‘P’), word (e.g., cool, awesome) in response to the content. However, due to the larger character limit, Facebook and Youtube show variations for two stats: maximum and average, where social sensors may post a large number of characters. Table 6 presents the stats of lexical analysis of 3 social information services by each topic.

Meta-data Properties. Social information services have various types of meta-data properties. To simplify, we define meta-data properties into two groups: Reactions and Propagation index (P-Index). Reactions describe the complements (e.g., likes, dislikes) in response to a content provided by social sensors. Propagation index defines the popularity (e.g., number of shares, views, downloads) of the content by other social sensors. We process a subset from Day 7 of our dataset to analyze the meta-data properties. It is noticed that despite the less reactions, Youtube contents are propagated more than Facebook and Twitter. Table 7 summarizes the results of three social information services by topic.

Table 6. Lexical analysis of social information services by topic
Table 7. Meta-data analysis

6 Discussion: Directions and Research Challenges

Social information service computing is an emerging area of service oriented paradigm [2]. Service orientation provides the leverage to articulate social media platforms with respect to various functional and QoS properties. Furthermore, the functional and QoS diversity provides the advantage to understand the analysis needs of social sensor data. In addition, this diversity can be utilized by analysts (e.g., Bella) to effectively process and analyze various social information services. Despite the benefits, the service orientation of social media platforms adhere to several challenges. In the remainder section, we provide research challenges associated with the service orientation of social media platforms.

6.1 Data Collection

With the rapid advances in the availability of cheap Internet across the globe, social information services have become a vital tool for online data generation and sharing. According to statista.com, by the end of 2018, 2.44 billion social sensors will be sharing their data online. By considering the enormous number of social sensors, a large amount of the ‘Big Social Data’ is expected to be generated. Current data collection methods such as platform provided APIs, scraping tools, etc., may not be able to cope with the increasing amount of data. One major challenge is to develop frameworks, infrastructures and data services to efficiently collect the required data from various social information services, and deliver to various applications.

6.2 Social Information Service Noise

Social information services contain valuable data shared by social sensors. However, extracting the required information from social sensor data is not an easy task. There are several factors that effect the quality of the data. For instance, human languages are ambiguous and contain various types of slang, abbreviation, emotions, irony, etc. Moreover, many social information services put certain limitations (e.g., text length) for the data sharing. Consequently, social information services data comprised of various types of noise (e.g., Spam, irrelevant data). One key challenge is to develop multiple noise filtering methodologies for different social information services.

6.3 Social Information Service Analysis

Social information services are heterogeneous in nature. These services have different application usage such as image sharing, video streaming, personal information sharing, etc. As a result, these services have various features (e.g., lexical, text type, size). Consequently, an information analysis tool (e.g., sentiment analysis) may not be able to effectively and efficiently analyze the data obtained from different social information services. Hence, specialized information analysis techniques such as machine learning algorithms, lexical corpus and text mining tools are required to analyze the data based on the dynamic features. Furthermore, the meta-data generated by social information services contains useful descriptive information. It can be used to determine various social senor patterns, sentiment, etc. Moreover, it can be utilized to complement an analysis.

6.4 Social Sensors Diversity

Social information services are accessed by social sensors across the globe. These social sensors have various demographic features such as age, gender, race, religion, political affiliations, etc. The demographic features of social sensors may influence the behaviors and usage patterns of social information services. Moreover, it may also effect the QoS properties of the social information services. Thus, one important aspect is to understand the social sensor diversity and analyze its effects on social information services.

6.5 Social Information Service Manipulation

In service oriented paradigm, a key challenge is the automated service manipulation (e.g., service discovery, service ranking, service selection and composition) by end users for different application scenarios. Although, in this paper, we proposed a formal model to present social information services. However, comprehensive frameworks such as Web Ontology Language for Services (OWL-S), the Resource Description Framework (RDF), etc., are required to be implemented for the automated manipulation of social information services.

7 Conclusion and Future Work

In this paper, we argued that social media platforms are heterogeneous and different topics affect their data features. Thus, in order to effectively process and analyze these platforms, it is vital to understand their diverse quality features. To support our argument, we used the notion of service orientation to model and analyze social media platforms as services, i.e., social information services. We devised a formal service model to present social information services. We classified social information services based on their functional features. We proposed a novel QoS model to capture the non-functional properties of social information services. We collected the data from three social information services: Facebook, Youtube, Twitter, for three different topics: Politics, Entertainment, Health. We conducted experiments and quantified the results based on functional and QoS properties. The experiment results strengthen and support our hypothesis.

Limitations and Future Work Currently, we analyzed the data from three social information services. In contrast, the cyberspace of social information services is broad. Moreover, our analysis does not include demographic features (e.g., age, race) of social sensors as QoS. In future, we aim to extend the QoS model by including the demographic features of social sensors. Finally, we plan to develop a composition framework that will use the QoS properties to dynamically compose services for social information service analysis applications.