1 Introduction

Crowdsourcing [11] is an emerging Internet-based innovative service pattern that solve challenging problems by collective intelligence of the masses. Firstly challenge is publicized, and the masses participate into the collaborative problem-solving process in terms of their individual initiatives; each participant evaluates current state of the challenge and takes specific action on it in terms of his knowledge and skills; as a result, the state of the challenge is updated; such iteration continues, until the challenge is solved or is proved to be unsolvable. This is a typical social collaboration process.

Open Source Software (OSS) development is a representative service crowdsourcing [4]. Taking a “bug fix” in OSS development as an example, a bug reported by a user is a challenge, and distributed developers in the team reproduce the bug, analyze its root cause, locate the source code that might cause the bug, specify bug-fix solution, allocate a developer to fix it, then submit bug-fix code for evaluation and confirmation. Although all bug-fix process roughly follow above steps, there show diversified process structures for different bugs.

In traditional service choreography scenario [6], a collaboration process is defined as a set of correlated activities with pre-designed workflow and pre-specified responsibilities for each pre-specified role. However, in service crowdsourcing, collaboration processes cannot be planned in advance. This is because different participants have different levels of knowledge/skills and different preferences/interests; therefore, their behaviors are determined by their own initiatives and the latest status of the challenge. To sum up, social collaboration processes of crowdsourcing show high degree of stochastic and dynamic nature.

Nevertheless, there should be some features used to delineate distinct characteristics of each crowdsourcing process. For example, number of steps in a social collaboration, total duration, number of participants, time intervals between neighboring steps, etc. With these features, we can summarize commonalities and diversities among collaboration processes of different crowdsourcing challenges, and get a deep understanding on the inherent laws of collective intelligence based problem solving.

In an OSS project, bugs keep constantly outpouring over time, and there is a core team that repeatedly participates into the bug-fix processes. We conjecture that, as time goes on, frequently-occurring social collaboration habits might gradually take shape, i.e., there might appear similar or repeated collaboration patterns. This leads to RQ1: are there any frequent-occurring social collaboration patterns in service crowdsourcing? For this RQ, we propose two types of social collaboration patterns (CP): Participant-oriented Pattern (PP), and Role-oriented Pattern (RP). An extended Generalized Sequential Pattern (GSP) algorithm is put forward to identify PPs and RPs. Statistical analysis is conducted on the characteristics of the identified patterns.

RQ2 of this paper is to validate whether there are significant commonalities and diversities among different collaboration processes in terms of social collaboration features and patterns. For this RQ, we cluster bug-fix processes w.r.t their CFs and summarize the commonality inside each cluster and the diversities among different clusters. We then measure the individualization degree of collaboration patterns in each long-standing crowdsourcing team (i.e., an OSS project). Results validate our conjecture and implicit commonalities and diversities are preliminarily identified.

This study is based on GitHub. We collect social collaboration related data for bug-fix (“issues”) in 10 selected OSS projects and conduct an empirical study. To answer above RQs would help crowdsourcing coordinators get a clear understanding on the collaboration habits of their participants, thus facilitating better task allocation and collaboration predication. Besides, it extends traditional service choreography research on stable or fixed service processes into the concerning of stochastic and dynamic social collaboration processes which exist widely in more and more Internet-based service crowdsourcing scenarios.

2 Social Collaboration in Service Crowdsourcing

2.1 Challenges and “Actions” in Service Crowdsourcing

Problems to be solved in crowdsourcing is defined as challenges. For example, in OSS development each to-be-fixed bug is regarded as a challenge. In GitHub, it is called an issue. After an issue is reported, a social collaboration process is initialized to solve it.

An issue is described by attributes such as Created time, Proposer, Closer, Labels/tags, Milestone, Assignee, and Status. To solve an issue in the crowdsourcing way, GitHub offers a set of issue-related actions that can be taken by any participants with required permissions: commit, createIssue, comment, closeIssue, reopen, addLabel, deleteLabel, addMilestone, deleteMilestone, reference, and assign.

2.2 Collaboration Features (CF)

  1. 1.

    Collaboration Duration (CD): time interval from the date when an issue is reported to the date when it is closed (or current date if it is now still open);

  2. 2.

    Collaboration Steps (CS): total number of actions that participants take to fix the issue;

  3. 3.

    Max Interval (MaxI): the maximum time interval between two neighboring actions during the collaboration;

  4. 4.

    Min Interval (MinI): the minimum time interval between two neighboring actions during the collaboration;

  5. 5.

    Median Interval (MidI): the median time interval between two neighboring actions during the collaboration;

  6. 6.

    Number of distinct Participants (NP): how many participants are there in the collaboration to fix the issue;

  7. 7.

    Number of Comments (NC): how many comments are included in the actions of the collaboration.

2.3 Social Collaboration Patterns and GSP-Based Pattern Mining

We define two types of social collaboration patterns (CP) to describe the frequent-occurring collaboration habits in a long-standing crowdsourcing team who frequently collaborate for solving challenges. In GitHub, a team is a virtual group composed of developers from all over the world. A CP is composed of a set of sequential actions each of which is taken by a specific participant (such CP is called Participant-oriented Pattern, PP) or by an abstract role (such CP is called Role-oriented Pattern, RP). PP is to delineate the stable collaboration habits among concrete participants, while RP is focused solely on the sequence of actions but does not care about who takes each action, i.e., multiple participants can be abstracted into a role if they perform the same actions in multiple collaboration processes. Thus, a RP may be regarded as the abstraction of a set of PPs which own the same sequential actions but different groups of participants.

A PP is defined by \(PP\,{::}=\,<PS,AS,M>\) where PS is a set of participants each of which has a distinct identification; AS is a sequence of social actions each of which is defined by the action type, and M is the mapping between PS and AS, i.e., \(\forall m\in M\), \(m=p\rightarrow index(a)\) (\(p\in PS, a\in AS\)) indicates the action a is taken by the participant p, and index(a) is the position where a is located in the sequence AS. The definition of RP is similar as the ones of PP: \(RP\,{::}=\,<RS, AS, M>\) where RS is a set of abstract roles; \(\forall m\in M\), \(m=r\rightarrow index(a)\) implies that the action a is taken by the role r.

In an issue fix process, all the actions occur sequentially in terms of their timestamps, so do CPs that are hidden in the process. Thus, traditional sequential pattern mining approaches can be employed to identify CPs from historical crowdsourcing processes. In our study, the Generalized Sequential Pattern (GSP) algorithm is adopted. As for PP, the GSP algorithm can be directly applied to the historical crowdsourcing processes. As for RP, extensions on GSP are required to deal with abstracting concrete participants into roles in terms of the responsibilities that each participant takes in these processes. Due to limited space, the pseudo-code is not given here.

3 Empirical Study

3.1 Dataset

To study social collaboration features (CFs) in crowdsourcing, we require a set of sample crowdsourcing processes; and to study social collaboration patterns (CPs), we requires there are a fair amount of crowdsourcing processes coming from the same long-standing crowdsourcing teams. By this criterion, from GitHub we select 10 OSS projects all having plenty of issues. They are diversified in a variety of perspectives, such as number of commits, team size, number of stars, and the programming languages.

We collect the issue-fix process data from GitHub Archive, an official data repository of GitHub. Data in GitHub Archive is organized by “events”, i.e., every action taken by GitHub developers are recorded as an event. There are two types of events related to issues: IssueCommentEvent which logs a “comment” in an issue, and IssuesEvent which logs the rest of action types except “comment”. Data is in the form of JSON. The time range of the collected data is between Year 2011 and 2016. In 10 projects, total 53,475 issues and 248,000 actions are collected. Social collaboration processes for these issues are recovered.

3.2 Analyzing Collaboration Features

Statistics of CFs.

We calculate the values of 7 CFs for the collaboration processes of all issues, then make statistics on the distribution of each CF of issues that belong to the same project. The following phenomena are observed:

(1) Apart from few projects, all the 7 CFs have quite similar median and average among 10 projects, indicating that issue-fix processes have high degree of similar characteristics on the whole, even these issues are from different projects. However, in terms of each CF, the distribution shapes in different projects look diversified (e.g., the distribution intervals, number of outliers, etc.).

(2) Distributions of the 7 CFs are all right-screwed, indicating that most of issues have relatively fewer participants, fewer collaboration steps, shorter time intervals between neighboring actions, and shorter durations. This further tells us that a majority of challenges in crowdsourcing do not require very complex social collaboration. However, the existence of outliers in the distribution of 7 CFs implies that a few challenges are to be solved by complex collaborations (i.e., longer duration, more participants and collaboration steps, etc.). This can be also proved by the fact that average values of these CFs are generally close to the upper quartile.

Clustering Issues of One Project w.r.t CFs. In order to check whether the proposed CFs have enough distinguishability, we make clustering analysis on the issues belonging to the same project. K-means clustering algorithm is adopted, and Xie-Beni index is used to evaluate the quality of clustering so that optimal number of clusters can be found. As different CFs have different distribution intervals, data normalization is made before clustering. For 10 projects, at least 2 and at most 9 clusters are obtained, and Fig. 1(a) and (b) demonstrates clustering results of the projects JQ and GO with 4 and 5 clusters, respectively. Spider diagram is employed to compare different clusters w.r.t CFs, in which the value on each dimension is the average of the corresponding CF of all the issues belonging to the same cluster.

Chi-square test is adopted to test the independence of four clusters w.r.t CF values, and the result shows that there are significant difference among them (\(p \text {-value} = 0\)). This proves that the proposed CFs have significant distinguishability.

Fig. 1.
figure 1

Issue clusters w.r.t 7 CFs

Clustering Issues of Multiple Projects w.r.t CFs. Here we cluster issues of all the 10 projects together to identify whether there are commodities and difference between the social collaboration of different projects. By Xie-Beni index, the optimal number of clusters is 4, and the result is shown in Fig. 1(c).

This result shows significant diversities among four clusters. Detailed analysis on such diversity is not presented here, but obviously the complexity of social collaborations becomes more and more lower from Cluster 0 to Cluster 3.

We make statistics on the percentages of issues belonging to each cluster in every project. The following phenomena are observed:

(1) Cluster 1 and Cluster 2 are dominating clusters because the proportions of these two clusters are about 30 %–40 % and 40 %–50 % in all projects, respectively. This indicates that there is a high degree of commonality among all projects. By observing CFs of the two clusters, we can describe such commonality by “the dominating social collaboration to solve crowdsourcing challenges is with medium complexity”.

(2) The proportion of Cluster 0 in most of projects are comparatively low, indicating that there are not too many challenges that are to be solved in a very complicated way. However we also see there are some projects having higher proportion of Cluster 0 (such as FCC, GO and TJ), which proves that there are diversities among projects.

(3) The proportion of Cluster 3 in most of projects are more than 17 %. Collaboration processes of such challenges are in very low complexity and by small-scale teams, which is another commonality among projects.

(4) In terms of the proportions of four clusters, 10 projects are classified into three types: (a) FCC and GO which have comparatively higher proportion of Cluster 0 and lower proportion of Cluster 3; (2) ELE, DT, GOGS, JQ, TS and FS all of which have comparatively lower proportion of Cluster 0; (3) DK and TJ which have comparatively more balanced proportions of the four clusters.

3.3 Mining and Analyzing Social Collaboration Patterns (CP)

Comparison Between PP and RP.

For CP mining from historical social collaboration processes, we use different \(min\_sup\) for different projects because they have different number of issues. A minimal support ratio (r) is specified in the range \(1\,\%-5\,\%\), then \(min\_sup\) for one project is set by multiplying r with the number of its issues.

First we make comparison on the numbers of the obtained PPs and RPs. The project JQ owning 2,094 issues is used as an example. Comparisons are shown in Fig. 2(a) under different r. It is seen that: (1) GSP-based algorithms can identify a large number of PPs and RPs, indicating that social collaboration patterns do exist in a long-standing crowdsourcing team; (2) With the increasing r, the difference between the numbers of PP and RP keeps decreasing, and when \(r \ge 4\,\%\), the number of RP is larger than PP. This implies the fact that there do exist a type of collaboration processes which contain frequent-occurring sequential patterns, but these patterns are not always taken by the same group of participants, i.e., different participants tend to adopt the same collaboration styles.

Fig. 2.
figure 2

Comparison of RP/RP and comparison of individualized patterns

Individualized Patterns.

We are interested in whether there are individualized collaborations patterns, i.e., individualized collaboration habits among participants. A metric called Individualized Pattern Index (IPI) is proposed to measure the individualized degree of a RP in a specific project. If a RP appears in many projects, it tends to be a common pattern; if it appears in only one or a few projects, it tends to be an individualized pattern. IPI is measured by \(IPI(RP,i) = \frac{P(RP,i)}{\sum _{k=1}^{N} P(RP,k)/(N-1)}\) where N is number of projects (here \(N=10\)), and \(P(RP,i) = \frac{freq(RP,i)}{Num\_Issues_i}\) where freq(RPi) is times of occurrence of RP in the historical social collaboration of i-th project, and \(Num\_Issues_i\) is total number of issues in i-th project.

This formula ensures that, if a pattern’s occurrence frequency in one project is higher than in the other projects, it may be an individualized one of this project, so it has a greater IPI. A threshold for IPI is set to judge whether a pattern is individualized or not w.r.t a project. In the study we set the threshold = 3.0.

Afterwards, we calculate the ratio of individualized patterns in each project, and result is shown in Fig. 2(b). This result further validates the fact that different crowdsourcing teams exhibit diversified social collaboration habits, although the individualized degree are not quite the same. Some projects such as JQ, TJ and DK demonstrate more individualized collaboration patterns, while projects such as FS, GO, and FCC tends to adopt more common collaboration patterns.

4 Related Work

Collaboration-based service crowdsourcing is essentially a set of incremental and iterative contributions on a set of artifacts. Participants observe state transitions of these artifacts and make decisions on what actions they would take [1]. This is what is called by Liptchinsky et al. [3] an “information-centric” approach to model social collaborations. Crowdsourcing occurs in web-based collaborative working environment and collaborative traces of participants are logged [2]. By mining these traces, social collaboration processes can be recovered by process mining [7], and mechanism of how collective intelligence comes into being were explored by Zhang et al.[10]. In terms of collaboration patterns, Smirnov et al. [8] made a multi-dimensional classification. Onoue et al. [5] used the percentages of a variety of social actions as the representation of behavior patterns in OSS. Xuan et al. [9] worked on a simple pattern which is composed of only two types of actions and a HMM model was employed to describe the individualized behavior patterns of OSS developers.

5 Conclusions

(1) We put forward 7 distinct CFs for service crowdsourcing processes. Result of clustering on bug-fix processes in terms of these CFs demonstrates that they can significantly distinguish different crowdsourcing processes, i.e., they have significant distinguishability. (2) Based on an extended GSP algorithm, two types of CPs (PP and RP) are mined and results have verified there indeed exist frequent collaboration patterns in long-standing crowdsourcing teams. (3) Besides significant commonalities, different OSS projects show discriminative CFs and CPs, too, especially on the ratio of individualized patterns and the ratio between numbers of RP and PP.