Keywords

1 Introduction

In online community question answering (CQA) sites, active participation is essential in keeping the network active [7]. Stack Overflow, a popular CQAFootnote 1, is an online social networking site where users can ask specific computer programing questions and other users can post answers to the questions. Because quality posts are essential in keeping the network active [7], if people who post questions do not receive quality answers, they could be discouraged from returning to the site. Similarly, if the questions are not well written or are of poor quality, this could discourage people who often post answers from doing so. Thus, it is important for the network’s users to continuously post quality content. To influence users to do so, Stack Overflow has a voting system where community members can vote for questions and answers depending on the perceived quality of such posts. According to Stack OverflowFootnote 2, “Voting up a question or answer signals to the rest of the community that a post is interesting, well-researched, and useful, while voting down a post signals the opposite: that the post contains wrong information, is poorly researched, or fails to communicate information. The more that people vote on a post, the more certain future visitors can be of the quality of information contained within that postnot to mention that upvotes are a great way to thank the author of a good post for the time and effort put into writing it!”. Voting is central to the model of Stack Overflow because voting is one of the ways users who consistently post useful content are rewarded with reputation points, privileges, and badges. Reputation shows how much the community trusts a user and it can be earned by posting quality questions and answersFootnote 3. The more reputation points earned by a user, the more privileges they have in the network. For example, a user who has earned at least 10,000 reputation points earns “access to moderator tools” privilege which gives them access to reports exclusive to site moderators and the ability to delete questionsFootnote 4. Badges are rewarded to users in the network who are especially helpful in the communityFootnote 5. Some badges are rewarded as a result of the votes earned by users on their question and answer posts. For example, the “nice question” badge is rewarded to users whose question post has earned 10 scores. Scores are computed based on the number of upvotes and downvotes a post has earned.

To encourage users to write quality posts (that can earn upvotes), Stack Overflow provides guidelines on how to write good questionsFootnote 6 and answersFootnote 7. For example, Stack Overflow recommends proofreading before posting a question. Despite these guidelines and the rewards in place for quality posts, there are still several posts in Stack Overflow with negative scores. This suggests that such questions received more downvotes than upvotes. Negative votes on posts could be discouraging to the users who make such posts. Thus, it is important to explore the characteristics of posts with negative scores to identify what makes a post receive negative votes. This could inform users on why their posts could receive negative votes and what can do to increase the chances of receiving upvotes on their posts.

The quality of posts in CQAs is an ongoing research area. Bazelli et al. [12] explored the relationship between the personality traits of users and the votes received by the users’ posts in order to determine if people of particular personality traits received only certain types of votes. Gantayat et al. [19] explored the difference in the accepted answer to a question and the answer with the highest number of upvotes in order to determine if the post voted by the asker as the answer to their question was also the post with the highest number of upvotes from the community. Yao et al. [39] studied high-quality posts in Stack Overflow using the votes received by the post. They developed an algorithm to identify high-quality posts soon after they are posted on Stack Overflow. Although these researchers explored the quality of posts in the community, they did not consider the emotions or sentiments expressed in the posts in addition to other features of the post such as the length of post and time of day the post was created. To fill that gap, we aim in this study to investigate the difference in (1) sentiments and emotions, (2) length of posts, (3) creation time of posts, and (4) the subject area posts were tagged with between posts with positive scores and those with negative scores.

The sentiments and emotions expressed by users in their posts in online communities have been shown to influence the popularity of such posts [18, 23]. We thus hypothesize that the emotions and sentiments expressed by users in their posts could have an effect on the type of votes received by such posts. In addition, research has shown that the length of a post and the time it was created in CQAs influence people’s attitude towards the post [11, 24]. We, therefore, hypothesize that the length of posts and time when posts were created in Stack Overflow could influence the votes the posts receive in the community. Furthermore, the subject areas questions are tagged with have been shown to influence the response time of answers posted in response to the questions [14]. We further hypothesize that the subject areas posts are tagged with could influence the scores received by the posts.

To validate our hypotheses, we compared the posts that have earned high negative scores to those that have earned high positive scores and explore the differences in variance in the timing of the posts, the emotions and sentiments used in writing the posts, the length of the posts and the tags/subject areas of the posts between these two groups. We analyze posts with high positive scores and those with negative scores to determine if there are any significant differences in these factors between both groups of posts. In addition, we developed a predictive model using these factors to predict if a post will likely receive a positive or negative score.

The results of our analyses suggest that posts created at night received significantly higher scores than posts written at other times of the day. In addition, posts with positive scores were significantly longer in length than posts with negative scores. Furthermore, posts with positive scores were significantly more analytical and clout and less authentic compared to posts with negative scores. To predict if a post will receive a positive or negative score, we developed and tested a predictive model using the word count of posts, sentiment and time of day the post was created. Using random forest machine learning algorithm, our model had a classification accuracy of 73%.

Our study contributes to the domain of CQAs in several ways. First, the results presented here explain how positive posts differ from negative ones and what features could differentiate positive posts from negative ones. Second, our study shows that the emotions and sentiments expressed by users in their posts should be considered in explaining the type of votes received by a post. Finally, our predictive model suggests that sentiments and emotions in addition to the length of posts, time of day of post and score are able to predict the score of posts with high accuracy.

2 Related Work

2.1 Stack Overflow

Stack OverflowFootnote 8 is a CQA platform where users can ask and answer specific IT related questions. Authors of questions can earn reputation and rewards when their posts get upvoted. Stack Overflow currently has over 5 million users with over 11 million questions. Stack Overflow is an active research area for CQAs with researchers studying the effect of rewards on the community and exploring voting patterns of users. While the former has received a lot of attention in recent times [4,5,6,7,8,9,10] there is still room for more research on the latter.

Research on voting patterns could be in the form of understanding what makes a good post in the community. For example, Nasehi et al. [29] explored the features that make a good code example in Stack Overflow. In Stack Overflow, questions are often accompanied with code examples if the asker needs help with their code. Nasehi et al. concluded that the quality of the code example accompanying a question is as important as the question itself. By analyzing posts with high scores, the authors determined the characteristics of good code examples; this could explain why posts received upvotes. Similarly, Asaduzzaman et al. [11] investigated why some questions in the community go unanswered. They concluded that questions that go unanswered failed to attract experts, were too short, were sometimes a duplicate and often times such questions were too hard or time consuming. Bazelli et al. [13] also investigated voting patterns in the community by exploring the personality of users on Stack Overflow based on their reputation. Their results suggest that the users who have high reputation are more extroverted compared to the users who have low reputation. They also concluded that the authors of posts with positive votes showed fewer negative emotions compared to authors of posts with negative votes. Our research differs from theirs because in addition to the emotions of the users, we also explored their sentiments. Furthermore, we developed a predictive model using sentiment and emotion in addition to length of post, time of day post was made and score of post.

2.2 LIWC

In this paper, we identify sentiments and emotions of users in Stack overflow using the Linguistic Inquiry Word Count tool (LIWC) [33]. The LIWC tool reads text and determines what percentage of words in the text reflect various dimensions of sentiments and emotions of the writer. LIWC works by calculating the percentage of given words that match its built-in dictionary of over 6,400 words for different dimensions of sentiment and emotions. The LIWC tool has several dimensions for sentiments and emotions which include (1) analytic which refers to how analytical a user’s text is, (2) clout which refers to the social status, confidence and leadership displayed in the text by the author, (3) authentic which shows how authentic the author is and (4) tone which describes the emotional tone of the author.

LIWC has been used extensively in identifying personality of users in online social communities with success. Bazelli et al. [13] used the LIWC tool in exploring the personality traits of users in a Stack Overflow. Their research suggests that top contributors in the community are extroverts. Romero et al. [36] also used the LIWC tool in their study of social networks. The authors explored how the personality traits and behavior of decision makers in a large hedge fund change based on price shocks. Adaji et al. [3] developed a personality based recipe recommendation system for a popular recipe social site, allrecipes.com. In their study, the authors used the LIWC tool to identify the personality of users. In investigating low review ratings in Yelp based on the personality of users, Adaji et al. [2] used the LIWC tool to identify users’ personality.

The LIWC tool has also been used in natural language processing for the analysis of sentiment and emotions of users from their text. Kacewicz et al. [22] investigated the use of pronouns to reflect standings in social hierarchies by analyzing the text of users using the LIWC tool. Their research suggests that people with higher status use fewer first-person singular pronouns such as “I”. In addition, people with higher status also use more first-person plural pronouns such as “we” and more second-person singular pronouns such as “you”. Newman et al. [30] also used the LIWC tool to explore the linguistic style of writers in order to identify deception. Their research suggests that people who are deceptive showed lower cognitive complexity and use fewer self-references.

Based on the popularity and success of the LIWC tool as reported by other researchers, we chose to use it in this research in identifying the sentiments and emotions displayed by users in their posts.

3 Methodology

The aim of this study is to investigate the difference in (1) sentiments and emotions, (2) length of posts, (3) creation time of posts, and (4) the subject area posts were tagged with between posts with positive scores and those with negative scores. In this section, we describe the methodology used in the study.

3.1 Data Collection

To carry out this study, we used data from Stack Overflow’s data explorerFootnote 9. The data explorer allows one to directly query Stack Overflow’s publicly available dataset. We extracted question and answer posts that were created within the last one year with a score of 5 or less which we termed negative_posts and posts with a score of at least 20 which we called positive_posts. These values were chosen based on the average score of posts on Stack Overflow. We had a total of 10,005 posts with negative scores and 4,714 posts with positive scores.

3.2 Predictive Model

To predict if a post will get a positive score or a negative score, we developed a predictive model using the following features:

  • word count: the length of the post

  • sentiments and emotions identified from post

  • time of day the post was made: morning, afternoon or night.

We applied various classification algorithms to predict if a post will receive a positive or negative final score. These include logistic regression, random forest, k-nearest neighbor, Naïve Bayes, support vector machine, and neural networks. These algorithms were selected based on what other researchers have used in the past [7, 11, 39].

In developing and testing our model, we used Python’s Scikit-learn machine learning moduleFootnote 10. We chose this module because of its ease of use, performance, documentation and API consistency [32]. In addition, several researchers have successfully used it for supervised and unsupervised learning [17].

The following section briefly describes the classifiers we used and how they were implemented using Pythons’ Scikit-learn machine learning module

Logistic Regression

Logistic regression is a classification model used to predict the outcome of a dependent variable using one or more predictors. It can be used when a model has one nominal dependent variable and two or more measurement variables. Logistic regression has been used extensively in predictive analysis [27, 28, 38] and research has shown that it is effective in producing quick and robust results [28]. Logistic regression was implemented using the LogisticRegression class of Scikit-learn.

Random Forests

Random forests is a well-researched classification algorithm that is known to be robust against over-fitting [26]. The algorithm does not consider all the predictors [25] which is necessary in a case where there is a very strong predictor in the data set along with several moderately strong predictors. Random forests will not consider the strong predictor, so other predictors will have more of a chance. Random forests was implemented using the RandomForestClassifier class of Scikit-learn.

K-nearest Neighbor

K-nearest neighbor classifies data based on a “majority vote” of the nearest neighbors of each point. K-nearest neighbor is non-parametric, thus the structure of the model is determined from the data. K-nearest neighbor is known to be highly intuitive with low classification errors [15]. K-nearest neighbor was implemented using KNeighborsClassifier class of Scikit-learn with 5 neighbors.

Naïve Bayes

Naïve Bayes is an efficient and effective learning algorithm that uses Bayes theorem. Bayes theorem describes the probability of an event based on previous information about the event; the probability is revised given new or additional knowledge. Naïve Bayes assumes strong independence between predictors [1]. Naïve Bayes was implemented using GaussianNB class of Scikit-learn.

Support Vector Machine

Support vector machine represents variables as points in space, mapped in such a way that the variables are divided into classes by a clear gap that is as wide as possible; new variables are easily assigned to a class based on which side of the gap they fall [21]. Support vector machine was implemented using the SVC class of Scikit-learn.

Neural Networks

Neural networks are nonlinear regression techniques inspired by theories about the central nervous system of animals, in particular the human brain. In this algorithm, hidden variables model the outcome of neural networks. These hidden variables are a linear combination of all or some of the original predictors. A linear combination of the hidden variables form the output or prediction [25]. Neural networks was implemented using the MLPClassifier class of Scikit-learn and five hidden layers.

3.3 Evaluating the Algorithms

The algorithms were evaluated by computing their classification accuracy, precision, recall and F-score. Classification accuracy is calculated as the sum of correct classifications divided by the total number of classifications. While precision is the fraction of items retrieved by the algorithm that are relevant, recall is a fraction of all relevant items that were retrieved by the algorithm. The F-score is the average of precision and recall [16]. The closer these metrics are to 1, the higher the performance of the algorithm. The closer they are to 0, the lower the performance of the algorithms.

4 Analyses and Results

We used four main factors in comparing posts with negative scores to those with positive scores: (1) the creation time of posts, (2) the sentiments and emotions used in writing posts, (3) the length of the posts, and (4) tags/subject areas of the posts.

4.1 Creation Time of Posts and Scores

A two-way ANOVA was conducted to determine if the average score received by posts was (1) different based on the time of day the post was created and (2) significantly different between the two groups of post: negative_posts and positive_posts. The time of day was classified into three: morning - between 12:01 am to 12:00 pm (n = 2205), afternoon - between 12:01 pm and 6:00 pm (n = 1418), and night - between 6:01 pm and 12:00 am (n = 992). There were no outliers, as assessed by boxplot; data was normally distributed for each group, as assessed by Shapiro-Wilk test (p > .05); and there was homogeneity of variances, as assessed by Levene’s test of homogeneity of variances (p = .120). Data is presented as mean ± standard deviation.

There was a statistically significant interaction between the groups of post (negative_posts and positive_posts) and time of day the posts were made, F(2,14713) = 3.117, p = 0.044, partial η2 = .01. Therefore, an analysis of simple main effect for negative_posts and positive_posts was carried out. There was a statistically significant difference in mean score of posts between the different times (morning, afternoon and night) and positive_posts F(2,14713) = 4.662, p = .009, partial η2 = .001. The average score for positive posts as shown in Fig. 1 decreased from morning to afternoon from (57.44 ± 353.49), to (49.50 ± 258.92), and increased significantly in the night to (73.21 ± 400.49). This suggests that posts created at night receive significantly higher votes compared to those created in the morning or afternoon. Thus, users who receive negative scores on their posts could consider creating their posts at night instead of during the day.

Fig. 1.
figure 1

Average upvotes posts received based on the time of the day they were created

4.2 Sentiments and Emotions

To understand the sentiments and emotions shown in posts, we carried out sentiment analysis of the posts in our dataset using the Linguistic Inquiry Word Count (LIWC) tool [37]. We chose this tool because it has been used extensively in research for analyzing user generated content in online systems [2, 3, 13]. The results displayed in Fig. 2 show the average sentiment and emotions for posts with negative score and posts with positive score for the different dimensions of sentiment and emotion. Our results suggest that posts with positive scores have a higher mean score for analytic, clout, and tone. Several emotions and sentiments had very low scores such as affect, positive and negative emotion. We attributed this to the type of social network Stack Overflow is. Compared to a social networking site such as Facebook which is meant for building friendship, Stack Overflow is mainly for learning. Thus, Stack Overflow posts will likely be more analytical and less emotional. We therefore excluded the sentiments and emotions with low mean scores and continued our analysis using only analytic, clout, authentic and tone because of their high mean scores.

Fig. 2.
figure 2

Sentiments identified by LIWC tool

According to the LIWC toolFootnote 11, the dimension analytic represents the extent to which people use formal words, and how logical and hierarchical their thinking patterns are. The dimension clout is an indication of the social status, confidence or leadership displayed by an individual through their writing or speaking. Authenticity, according to the LIWC tool, reveals how honest people are through their writing. People high in authenticity are more personal, humble and vulnerable. Tone indicates the emotions displayed by users in their posts.

To determine if there were any statistically significant differences in the mean sentiments between the posts with positive scores and those with negative scores, we carried out a two-way mixed ANOVA with sentiments: analytic, clout, authentic and tone as our within-subject factors and the type of score received by posts: negative_posts and positive_posts as our between-subject factors.

Mauchly’s test of sphericity indicated that the assumption of sphericity was not met for the two-way interaction, χ2(2) = 4979.784, p < .01, therefore degrees of freedom were corrected using Huynh-Feldt estimates of sphericity (ε = 0.86) as suggested by [20], since ε is greater than 0.75. There was statistically significant interaction between the sentiments and the type of posts: negative_posts or positive_posts, F(2.579, 37958) = 494.056, p < .0005, partial η2 = 0.32, ε = 0.86. To determine where the differences were, we carried out testing for simple main effects to identify any differences between the two groups of posts for each sentiment.

There was statistically significant difference between the sentiments: analytic, clout, authentic and type of posts: negative_posts and positive_posts. Difference in scores between analytic sentiment for the types of posts was significant at F(1,14716) = 400.58, p < 0.001, partial η2 = 0.26. Similarly, difference in scores between clout sentiment for the types of posts was significant at F(1,14716) = 1296.54, p < 0.001, partial η2 = 0.81 and difference in scores between authentic sentiment for the types of post was significant at F(1,14716) = 536.21, p < 0.001, partial η2 = 0.35. This suggests that the posts with positive scores were more analytical and clout compared to the posts with negative scores. In addition, the posts with negative scores were more authentic than posts with positive score. There was no significant difference in tone between the negative_posts and positive_posts.

4.3 Length of Posts and Tags

A Welch t-test was run to determine if there were differences in the length of posts between positive and negative posts due to the assumption of homogeneity of variances being violated, as assessed by Levene’s test for equality of variances (p < .001). Data are mean ± standard deviation, unless otherwise stated. There were no outliers in the data and types of posts were normally distributed, as assessed by Shapiro-Wilk’s test (p > .05). The positive posts were significantly longer (249.27 ± 344.739) than the negative posts (152 ± 2.8.33), a statistically significant difference of 97.26 (95% CI, 86.61 to 107.92), t(6383.718) = 17.892, p < .001.

We also compared the tags used in the posts with positive votes and those used in posts with negative votes to determine if both groups of posts had different tags. Figure 3 shows that there are similarities in the popularity of tags for both groups of posts; for example, both categories include several posts about Python, Java, JavaScript and Android. Both groups have similar top 10 tags. This suggests that popularity of tags was similar for both types of posts.

Fig. 3.
figure 3

Word cloud of tags used in posts with negative scores and positive scores

4.4 Predictive Model

We used the predictors word count, sentiment (analytic, clout, and authentic) and time of day (morning, afternoon and night) to predict if a post will have positive or negative scores. We excluded tags because there was no difference in negative_posts and positive_posts groups based on tags. Our data set was randomly split into 75% training and 25% test sets. We tested our model using six algorithms: logistic regression, random forest, k-nearest neighbor, Naïve Bayes, support vector machine, and neural networks. The results of these algorithms are presented in Table 1.

Table 1. Classification accuracy, precision and recall of the classification algorithms

Of the six algorithms used, random forest had the highest classification accuracy, precision and F-score.

4.5 Discussion

To better understand why some posts receive negative scores and others receive positive scores, we set out to investigate any differences in posts with positive scores and those with negative scores based on the following criteria: (1) the time of day the post was created, (2) sentiments and emotions used in writing the posts (3) the length of the post and (4) topics that the posts were tagged with. We further developed and tested a model to predict if a post will receive positive or negative scores using these criteria and six machine learning algorithms.

Our results show significant differences in the scores received by posts based on the time of day the posts were created. Our results suggest that posts with positive scores created between 6:01 pm and 12:00 am had significantly higher scores on average than posts created at other times of the day. Thus, in order to avoid negative scores on their posts and possibly receive higher positive scores, authors could consider posting questions and answers between 6:01 pm and 12:00 am.

According to the LIWC toolFootnote 12, the dimension analytic represents the extent to which people use formal words, and how logical and hierarchical their thinking patterns are. People low in analytical thinking typically write in more narrative ways, use less of formal logic and rely on knowledge gained from personal experiences [31]. On the other hand, people high in analytical thinking use formal logic, are more detailed in their explanations and avoid contradiction [31]. Our results show that the posts with negative scores are less analytical than those with positive scores. This suggests that posts with negative scores are written with less formal words, little logic and fewer explanations. Thus, in order to avoid low scores, authors should post questions and answers in Stack Overflow using more logic, detailed explanations and more formal technical words.

The dimension clout, as defined by the LIWC tool, is an indication of the social status, confidence or leadership displayed by an individual through their writing or speaking. People with higher clout typically use more first-person plural (such as “we”) and second-person singular pronouns (such as “you”). In addition, they use fewer first-person singular pronouns (such as “I”) [22]. People in this category tend to focus their attention outwards, towards the people they are interacting with. On the other hand, people low in clout are more self-focused and use more first-person singular pronouns (such as “I”) [22]. Our results show that posts with positive scores are written by people who have high clout. This suggests that posts with positive scores are written in a way to focus attention outwards and not in words on the writer. Such posts are also written with less use first-person singular pronouns and more use of first-person plural (such as “we”) and second-person singular pronouns (such as “you”). We therefore suggest that authors who post questions and answers in Stack Overflow should use fewer first-person singular pronouns.

People that are high in the dimension authenticity, according to the LIWC tool, reveal themselves to others (through their writing) in a more honest way. Such people are more personal, humble and vulnerable. On the other hand, people that are lower in authenticity use words that show lower cognitive complexity and more negative emotion words [30]. Our results show that posts with positive scores display less authenticity compare to posts with negative scores. Thus, in order to avoid negative scores, authors should post questions and answers using words that show less of their personal side and their vulnerability. These could include using words with higher cognitive complexity and positive emotions [30].

There was significant difference in the average length of posts between posts with positive scores and those with negative scores; the former had an average word count of 231 while the latter had an average word count of 152. This suggests that users who create posts with negative scores could improve their posts by writing more words that better explain their questions/answers. This is in line with other researchers that suggest that the length of a post is an influencing factor of the score the post will receive [11, 35].

We developed a predictive model using word count, the dimensions of sentiment and emotions: analytic, clout, authentic, tone, and time of day to predict if a post will receive a positive or negative score. We tested our model using six classification algorithms: logistic regression, random forest, k-nearest neighbor, Naïve Bayes, support vector machine, and neural networks and evaluated them using the classification accuracy, precision, recall. Random forest performed best with a classification accuracy of 73%, a precision of 77% and recall of 87%. Changing the number of trees from 5 to 100 had no effect on the accuracy of classification. This result is in line with previous research on prediction in Stack Overflow [7, 11]. Our results suggest that users could predict if their posts will receive a positive or negative score using the features length of the post, the dimensions of sentiment and emotions: analytic, clout, authentic, tone, and time of day the post was created using random forest algorithm.

5 Conclusion and Future Work

Voting is central to the model of providing quality posts in Stack Overflow. According to Stack Overflow, an upvote indicates that a post is interesting, well researched and useful, while a down vote indicates the opposite. People can earn rewards such as reputation, privileges and badges when their posts get upvoted. To encourage people to write good posts (which could lead to upvotes), Stack Overflow has a “How to ask” guide with suggestions on how to write good question posts and a similar one, “How do I write a good answer” with suggestions on how to write good answer posts. Despite these measures by Stack Overflow to ensure quality of posts, there are still several posts with negative scores. A post with a negative score indicates that the post received more down votes than upvotes. Negative scores could have a negative influence on users’ participation; if a user keeps getting negative scores on their posts, they could be discouraged from using the network. To better understand the scores received by posts, we investigated differences in some features of posts with positive scores and negative scores. In particular, we compared the time of day the post was created, sentiments and emotions used in writing the posts, the length of the post, and the topics that the posts were tagged with for posts with negative scores and posts with positive scores. The result of our analysis suggests that posts between 6.01 pm and 12 am received significantly higher scores than posts written at other times of the day. In addition, posts with positive scores were significantly longer in length than posts with negative scores. Furthermore, posts with positive scores were significantly more analytical and clout and less authentic compared to posts with negative scores. To predict if a post will receive a positive or negative score, we developed and tested a predictive model using length of post, sentiment and time of day post was created. Using random forest machine learning algorithm, our model had a classification accuracy of 73%.

Our research is limited in a few ways. First, the sentiment and emotions expressed in posts were identified using the LIWC tool. Thus, the accuracy of the sentiments identified depends on the accuracy of the tool. The tool has been used widely by researchers who attest to its accuracy [34, 37]. Second, the number of posts we used in this study (over 14,000) only represents a small fraction of the number of all the posts in Stack Overflow. We plan to re-run this study on a larger scale.

In the future, we plan to re-run this study on a larger scale using different datasets for training and testing. In addition, we plan to explore if the number of negative votes people receive influence their participation in the network. For example, if their participation has any correlation with number of negative votes over time.