Abstract
It is reported that task monetary prize is one of the most important motivating factors to attract crowd workers. While using expert-based methods to price Crowdsourcing tasks is a common practice, the challenge of validating the associated prices across different tasks is a constant issue. To address this issue, three different classifications of multiple linear regression, logistic regression, and K-nearest neighbor were compared to find the most accurate predicted price, using a dataset from TopCoder website. The result of comparing chosen algorithms showed that the logistics regression model will provide the highest accuracy of 90% to predict the associated price to tasks and KNN ranked the second with an accuracy of 64% for K = 7. Also, applying PCA wouldn’t lead to any better prediction accuracy as data components are not correlated.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Available literatures on motivation patterns of crowdsourcing workers have reported that the monetary prize associated with tasks is one of the top motivating factors to attract and involve potential workers in task competition [1]. The monetary prize usually represents the degree of task complexity as well as required competition levels [2, 3]. In practice, task requesters frequently employ expert-based methods to price tasks, which may involve a high degree of subjectivity, while the challenge of validating the associated prices across different tasks is a constant issue.
As of today, multiple pricing models are introduced with a focus on pricing strategy [4], the Context-Centric Pricing approach [5], the impact of price on workers’ behavior [6] and using machine learning methods [7] to help task requesters with predicting reasonable price range. However, none of these papers applied PCA to evaluate the accuracy of the presented. In this work, we aim to investigate that gap.
To address this issue, three different classifications of multiple linear regression, logistic regression, and K-nearest neighbor were compared to find the most accurate predicted price, using data extracted from TopCoder website [8]. The result of comparing chosen algorithms showed that the logistics regression model will provide the highest accuracy of predicting the associated price to tasks. Also, applying PCA wouldn’t lead to any better prediction accuracy as data components are not correlated.
The rest of the paper is organized as follows: Sect. 2 introduces the research design; Sect. 3 presents the result and discussion of the research conducted; Sect. 4 gives a conclusion and outlook to future work.
2 Research Design
2.1 Dataset and Metrics
The dataset used contains 514 component development tasks from Sep 2003 to Sep 2012, extracted from the TopCoder website. All tasks are completed, meaning receiving acceptable submissions with a score higher than 75. The total monetary prize is divided into the top-2 winners with a 2:1 ratio.
The initial analysis on the dataset implies that a typical task on TopCoder is priced as $750 [6] (i.e. $500 and $250 for the top-2 winners respectively), the average size of 2290 lines of code, and the median numbers of registrations and submissions are 16 and 4 respectively. And the median score of the winning submission is 94.16 out of 100. A current common impression is that crowdsourcing is more feasible for easy and simple tasks, the data shows that it is also feasible for complex component development at the scale of 21925 lines of code. The maximum number of registrants of 72 is surprising since the nature of the task is competitive considering that only top-2 winner gets paid.
2.2 Dataset Preparation
The outlier for each variable was identified and removed. To do so we applied 1.5 of interquartile to add and subtract to the third and first quartiles respectively. Figure 1 illustrates the distribution of data.
Also deeper analysis in the database, it became clear that 8 variables have the least relation with the monetary prize, therefor we ignored them in our analysis. Table 1 summarized the remaining data statistics in the clean database.
2.3 Empirical Studies and Design
In order to predict the monetary prize associated with each task this research studied three predictive modeling methods: linear regression, logistic regression, and KNN.
Multilinear Regression
The analysis on the training data suggested that monetary prize (MPi) follows a multiple linear regression, Eq. 1:
In which α is the constant term and βi is the coefficient for the variable (Xi).
Logistic Regression
As it is reported the average monetary prize for a task in TopCoder is 750$. Therefor to apply logistics regression, we assigned 0 to tasks with a monetary prize of less than 750$ and 1 to tasks with a monetary prize of more than 750$.
KNN
To apply KNN, the first 20 different numbers for K were used to find the best neighbor numbers on the clean dataset. Then the PCA method applied to the dataset and the KNN was rerun to study the effect of PCA on the accuracy of KNN results.
3 Result and Discussion
3.1 Multilinear Regression
Monetary prize is a continuous number, ranging from 75 to 2000. This suggests that the multi linear regression model would be a good match for predicting the monetary prize. Because of this, we first used a multiple regression model to analyze the data. The initial result provides the R2 value of 0.5448 and degree of freedom on 376, which indicates a decent fitting line passing through the dataset. However, a significant number of task variables provide no significant impact on the prediction. Therefore, the insignificant variables were removed and the model re-run. The second model provides the R2 value of 0.5108 and the degree of freedom on 389, which is a worse fit to use for Monetary prize prediction.
The multilinear regression model is created based on 70% training and 30% testing data. The accuracy of the model is shown in Fig. 2. As it is clear, the prediction model tended to underestimate the monetary prize, however, the median of both actual and predicted monetary prize remained similar.
3.2 Logistic Regression
To apply logistic regression to our dataset, the first step is to convert monetary prize to discrete data. Therefore we assigned 0 to price less than 750$ and 1 to price more than 750$. This allowed applying logistics regression to the dataset. Similar to multiple linear regression, the initial model showed that a significant number of task variables were not influential in the equation. Therefore, to make a more efficient model, we only used the significant variables to create the prediction model. Figure 3 presents the classification of Task variables v.s probabilistic prediction of the monetary prize by the logistics regression model. A natural S-Curve shape that the model tends to take is clear.
Moreover, the confusion matrix of the model with the threshold for prediction at 50% provides a promising result, Table 2. The presented model successfully predicts the assigned monetary prize with an accuracy of 90%.
3.3 KNN
To find the best K nearest neighbor, the algorithm was run for K from K = 1, to K = 40. Interestingly K = 7 and K = 8 provide the maximum accuracy of 64%. Figure 4 represents the error trend for K between 1 to 40. And Table 3 reports the classification report for K = 7 as an example.
In the next step, the PCA was applied and the accuracy of KNN under PCA was analyzed. The result showed that PCA not only could not improve accuracy but also it decreased by 4%. Figure 5 shows the error rate for the KNN model after applying PCA for K between 1 to 40.
4 Conclusions
The monetary prize usually represents the degree of task complexity as well as required competition levels. In practice, task requesters frequently employ expert-based methods to price tasks, which may involve a high degree of subjectivity, while the challenge of validating the associated prices across different tasks is a constant issue.
To address this issue, three different classifications of multiple linear regression, logistic regression, and K-nearest neighbor were compared to find the most accurate predicted price, using a dataset from the TopCoder website. The result of comparing chosen algorithms showed that the logistics regression model will provide the highest accuracy of 90% to predict the associated price to tasks and KNN ranked the second with an accuracy of 64% for K = 7. Also, applying PCA wouldn’t lead to any better prediction accuracy as data components are not correlated.
In the future, we would like to focus on the similar crowd worker behavior and performance based on task similarity level and try to analyze a task- worker performance to report more decision elements according to the monetary prize, task size task utilization, and crowd workers’ performance.
References
Stol, K.-J., Fitzgerald, B.: Two’s company, three’s a crowd: a case study of crowdsourcing software development. In: The 36th International Conference on Software Engineering (2014)
Faradani, S., Hartmann, B., Ipeirotis, P.G.: What’s the right price? Pricing tasks for finishing on time. In: Proceedings of the Human Computation (2011)
Archak, N.: Money, glory and cheap talk: analyzing strategic behavior of contestants in simultaneous crowdsourcing contests on topcoder.com. In: Proceedings of the 19th International Conference on World Wide Web (WWW 2010), New York, NY, USA, pp. 21–30 (2010)
Wang, L., Yang, Y., Wang, Y.: Do higher incentives lead to better performance?-an exploratory study on software crowdsourcing. In: 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (2019)
Alelyani, T., Mao, K., Yang, Y.: Context-centric pricing: early pricing models for software crowdsourcing tasks. In: PROMISE: Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 63–72, November 2017
Yang, Y., Saremi, R.: Award vs. worker behaviors in competitive crowdsourcing tasks. In: ESEM 2015, pp. 1–10 (2015)
Mao, K., Yang, Y., Li, M., Harman, M.: Pricing crowdsourcing-based software development tasks, Piscataway, NJ, USA, pp. 1205–1208 (2013)
Topcoder website. http://www.topcoder.com
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Lotfalian Saremi, M., Saremi, R., Martinez-Mejorado, D. (2020). How Much Should I Pay? An Empirical Analysis on Monetary Prize in TopCoder. In: Stephanidis, C., Antona, M. (eds) HCI International 2020 - Posters. HCII 2020. Communications in Computer and Information Science, vol 1226. Springer, Cham. https://doi.org/10.1007/978-3-030-50732-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-50732-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50731-2
Online ISBN: 978-3-030-50732-9
eBook Packages: Computer ScienceComputer Science (R0)