Identifying crowdfunding storytellers who deliver successful projects: a machine learning approach

Pourroostaei Ardakani, Saeid; Hu, Jianwei; Zhang, Jing; Jin, Kaifeng; Cai, Tianhong; Graham Bellotti, Anthony; Hua, Xiuping

doi:10.1007/s11227-024-06785-4

Identifying crowdfunding storytellers who deliver successful projects: a machine learning approach

Open access
Published: 09 December 2024

Volume 81, article number 263, (2025)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

Identifying crowdfunding storytellers who deliver successful projects: a machine learning approach

Download PDF

Saeid Pourroostaei Ardakani¹,
Jianwei Hu²,
Jing Zhang³,
Kaifeng Jin⁴,
Tianhong Cai⁵,
Anthony Graham Bellotti⁵ &
…
Xiuping Hua⁶

767 Accesses
Explore all metrics

Abstract

Crowdfunding plays a key role in financial technology to provide individuals and enterprises with funding opportunities to establish start-ups and/or new business ventures. It is mainly used to link projects’ creators and backers, collect money and plan fundraising projects via social networks. This paper proposes a machine learning-enabled approach to analyse Kickstarter numerical and textual data and predict the successful funding and delivery of crowdfunding projects. It offers crowdfunding stakeholders benefits including creator credibility assessment, project risk reduction, and backer confidence enhancement. This research proposes a data preprocessing approach to prepare the dataset and extract the relevant features for the predictions. Besides, it trains and compares five numerical machine learning classification models and three text-mining methods to find the best-fitted numerical and textual analysis approaches. According to the results, the proposed SVM model outperforms the numerical benchmarks in terms of Accuracy, Precision, Recall, F1 score, and model Training latency. Moreover, BERT gives the best results if the dataset is complex, while Word2vec works better with simple features in textual analysis.

Predictive Analysis of Crowdfunding Projects

Experience mining based on text analytics and case-based reasoning to support crowdfunding design

Article 31 August 2023

Identifying the key success factors of movie projects in crowdfunding

Article 28 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Crowdfunding usually aims to raise funds from individuals and organisations to fulfil the gap of financing small entities and individuals who suffer from limited credit records [5, 55]. The deliverables and rewards of crowdfunding vary such as physical goods, virtual product, equity, acknowledgement, and selfless donation. Crowdsourcing applications have the capacity to be proposed for fundraising high-technology, artistic, or social projects via user social interactions (i.e. social media [4, 17, 31]).

The worldwide crowdfunding market reached $12.27 billion by the end of 2020, whereas it is estimated to be worth $25.8 billion in 2027 with a compound annual growth rate of 11% [46]. For this, several online platforms such as Kickstarter [10], Indiegogo [18], and Patreon [35] have been established to support crowdfunding projects. For example, Kickstarter has pledged $5943.77 million for project fundraising from July 2012 to July 2021 [47]. However, crowdfunding business models still suffer from drawbacks caused by demand uncertainties and moral hazards [48]. Hence, it becomes complicated to identify/forecast whether the project creators successfully deliver the goals and outcomes. Furthermore, the project delivery needs to cope with several challenges including the lack of funding threshold [12], price variation [56], and quality assurance [60].

Reward-based crowdfunding projects provide the project’s backers with rewards after the accomplishment of funding goals [52]. As Fig. 1 shows, they contain three stages: 1) Creation to set-up the project and prepare the deliverables according to a schedule, 2) Fundraising that provides analytical patterns - mainly the project’s story (i.e. background, motivation, components, and outcome), risk (i.e. project development challenges), and rewards to build predictive models [26], 3) Delivery that aims to finalise the project and monitor the project’s deliverables by analysing creator and backer’s data. According to [53, 58] and [16], the development of an accurate prediction model to forecast project delivery and fundraising remains an open question. Hence, the followings are required to analyse reward-based projects:

To identify the projects at stage 1 that will become successfully funded.
To highlight the features that best drive funding success.
To identify funded projects at stage 2 that will be successfully delivered.

1.1 Research gaps and motivation

Accurate crowdfunding success and delivery prediction relies on thorough data preparation and advanced machine learning methodologies. However, existing methods often fall short in addressing the complexities of crowdfunding numerical and textual data, leaving gaps in preprocessing, feature extraction, and model training and optimisation. This highlights the need for robust data preprocessing strategies to address issues such as complexity and unstructured formats to ensure the extraction of meaningful, high-quality features suitable for machine learning models. Moreover, there is a need to develop integrated machine learning models that seamlessly process both numerical and textual datasets and accurately predict Success and Delivery targets. Optimisation and rigorous evaluation of the machine learning models are also essential to identify the best-performing solution and enhance the reliability of crowdfunding predictions. Our research is motivated to bridge these gaps by introducing a full-scope machine learning solution tailored to prepare, process and analyse both numerical and textual crowdfunding data, with emphasising the potential of machine learning solutions for accurate and reliable crowdfunding predictions. The key research questions are outlined as follows:

RQ1: Which preprocessing steps are necessary for preparing numerical and textual data to train predictive machine learning models in crowdfunding applications?
RQ2: What are the optimal machine learning methods to analyse numerical crowdfunding datasets and predict project’s success and delivery?
RQ3: Which textual analysis approaches are best-fitting for predicting the success of funding and meeting delivery targets?

1.2 Contributions

This paper proposes a numerical and textual machine learning approach to predict fundraising success and project delivery in crowdfunding applications. It uses a well-known dataset that is crawled from the online Kickstarter platform [10] by python. It includes both numerical statistics and textual descriptions. The data understanding and preparation steps are important first stages in data mining projects [39]. Therefore, although previous studies have applied data analytics to address these research questions, it is expected that further work in preprocessing the data to handle data distribution, missing values, feature selection, and textual data will further help address the research questions. The dataset is cleaned and prepared via a hybrid numerical and textual preprocessing approach. Numerical predictive models include support vector machine (SVM) [37], K-Nearest neighbour (KNN) [38], Multilayer Perceptron Classifier (MLP) [57], Random Forest [24], and Gradient Boosting Classifier [34], while textual analysis techniques used in this study are the Term Frequency Inverse Document Frequency (TF-IDF) [41], Word2Vec[30], and Bidirectional Encoder Representations from Transformers (BERT) [11]. They are trained and evaluated to find the best-fitted numerical prediction and storytelling analysis techniques in crowdfunding applications. The followings outline the key contributions of this research:

To preprocess the dataset and extract/select data features (e.g., creator updates, pledged amount, and comments) that act as the indicators of successfully funded or delivered crowdfunding projects.
Train a numerical data model to predict crowdfunding projects’ fundraising and delivery status with improved performance.
Build a text-mining model to analyse crowdfunding storytelling and predict fundraising success and project delivery with improved performance.
Optimise, evaluate and compare the machine learning techniques to find the best-fitted predictive approaches for numerical and textual analysis in crowdfunding applications.

The rest of this paper is organised as follows: Sect. 2 reviews the literature to introduce crowdfunding applications and highlight state-of-the-art data analysis techniques in this field of research. Section 3 describes the research methodology and presents the experimental plan, with Sects. 3.1–3.3 describing the preprocessing and feature selection steps introduced for this study. Section 4 demonstrates and discusses the evaluation results in terms of model Accuracy, Precision, Recall, F1 score, and Training latency. Section 5 discusses the key findings of this study, outlines the theoretical and managerial implications, and introduces future directions and limitations. Section 6 concludes with a summary of research findings.

2 Literature review

This section introduces crowdfunding and reviews the literature to highlight the similarities, differences, and superiorities of state-of-the-art data analysis solutions in this field of research.

2.1 Crowdfunding

The emergence of crowdfunding enables a new business model that requires the support of qualitative analytics and uses quantitative analysis to calculate the outcomes of crowdfunding projects. Two groups of stakeholders manage crowdfunding projects: creators and backers [22]. The creators propose ideas, create crowdfunding projects, and set-up fundraising goals, while the backers pledge the budget and support the creators to complete the project.

The result of crowdfunding quantitative analysis is studied based on two factors including Success and Delivery. Success indicates whether the project is funded with a sufficient amount as requested by creators. It depends on the project stages that are defined by the project strategies. Delivery shows the status of a crowdfunding project if the creator has achieved the initial funding goals and delivered the planned rewards to backers. It is monitored and analysed to evaluate and/or predict the completion of a crowdfunding project.

The correlations between crowdfunding factors and Success and Delivery have been investigated and discussed in the crowdfunding literature. Calic and Shevchenko [7] propose an inverted-U-shaped correlation model to study Success, innovativeness, competitive aggressiveness, and risk management features. They report a positive non-monotonic relationship between the crowdfunding project proactivity and Success. Gafni et al. [15] use a text-mining approach to analyse the crowdfunding project descriptions. It demonstrates that the Success rate is increased if entrepreneurs’ names are frequently used within the project descriptions. In addition, Ren et al. [42] report a positive correlation between the frequency of arousal words in the crowdfunding project descriptions and Success. Zheng et al. [60] propose a correlation model to study the relationship of crowdfunding Success factors (i.e. project scheduling and sponsor enquiry satisfaction) and Delivery. Mollick [31] highlights crowdfunding factors leading to on-time Delivery, while Appio et al. [1] utilise a text-mining approach to highlight the reasons for the Delivery delay. Tuo et al. [52] use Qualitative Comparative Analysis (QCA) to evaluate crowdfunding Delivery status (e.g., on-time or late). Tran and Lee [50] use Linguistic Inquiry and Word Count (LIWC) dictionary techniques to demonstrate that project simplicity and real time creator interaction increase the chance of project Delivery, while Wang et al. [54] aim to recognise Delivery status using a Latent Dirichlet Allocation (LDA)-enabled configuration model and analyse antecedent factors using QCA.

2.2 Crowdfunding and social applications

To facilitate a successful crowdfunding project, proper management of the project process requires a deep understanding of critical factors and decision support research are proposed to aid the developments and completions. Song et al. [45] filter 26 candidate factors in their real-win-worth framework and make predictions of success by step wise regression. Pati and Garud [36] take consideration of the impact of social interaction on the success of crowdfunding and product characteristics’ moderating role, proving ideation stage benefits more from social activities than commercialisation stage and the positive relationship exists in social interaction and project success for products with incremental innovativeness than radical ones. Cappa et al. [8] reveal the impact of two types of products and four types of rewards on the total raised funds and find individual ownership products obtain more supports. Buttice and Ughetto [6] summarises current research trends of crowdfunding research. Although the current literature addresses aspects of the crowdfunding project lifecycle (i.e. initialisation, classification and organisation), it still lacks a comprehensive ML-enabled data analysis approach to predict project status and assist the creators in managing their projects.

2.3 Data analysis in crowdfunding

Data analysis techniques are usually used in crowdfunding applications to analyse the impact of crowdfunding factors/strategies on Success and/or Delivery [23] and [50]. They are classified into two categories: numerical [58] and textual [59]. The former analyses numerical statistics (e.g., the number of backers) that may change during the fundraising, while the latter processes textual data (e.g., project description) which stay unchanged during crowdfunding stages.

The numerical crowdfunding outcome status prediction problem can be posed as a binary classification problem, with the target variables to predict Success (True/False) and Delivery (True/False). There are several well-known machine learning algorithms that are proposed to solve this problem. Kamath and Kamat [20] forecast Kickstarter Success using Naive Bayes, Neural Network, Random Forest, and Decision Tree classification techniques. The results show that the Neural Network model has the best accuracy (84%) compared with other benchmarks. Yu et al. [58] analyse the correlations between crowdfunding projects and Success to predict project fundraising and report that multi-layer perceptron (MLP) has the best prediction accuracy (93%). Jhaveri et al. [19] extract eight features from the dataset and use Weighted Random Forest (WRF), AdaBoost, XGBoost, and CatBoostin techniques to predict fundraising Success rate. According to the results, WRF with AdaBoost model has the maximum accuracy of 84.79%. Sawhney et al. [43] utilise a linear kernel SVM to build a binary classifier and predict the Success with an accuracy of 92%. Tian [49] collects the information of 248,733 projects (from May 2nd 2009 to September 1st 2018) and utilise data analysis models such as Ridge, Elastic Net, Decision Tree, Random Forest, and Neural Network to predict the fundraising status. According to the results, Random Forest has the best accuracy results (87.85%). Tran et al. [51] show that XGBoost gives the best accuracy (i.e. 71.4%) to predict the project’s reward Delivery.

Textual crowdfunding data analysis focuses on text-mining techniques to build a predictive or descriptive model. Nam et al. [33] use TF-IDF technique to create a document-term-matrix (DTM) for a textual dataset with 1,980 crowdfunding projects and forecast the Success.

Table 1 Advantages and limitations in the literature

Full size table

This literature review outlines the machine learning and data analysis methods used for predicting success and delivery outcomes in crowdfunding applications, and highlights the critical factors that influence crowdfunding decision-making. Table 1 summarises their strengths and limitations. Yet, the findings reveal the absence of an integrated machine learning method capable of handling both numerical and textual features, while optimising and validating predictive models for enhanced and reliable crowdfunding project management.

3 Methodology

This research proposes numerical and textual predictive solutions for crowdfunding applications. The research methodology diagram is shown in Fig. 2. As it shows, the crowdfunding dataset is split into numerical and textual parts, each of which uses a data preprocessing approach to clean and extract high-quality features. Machine learning and text-mining techniques are used to process numerical and textual data features and predict Success and Delivery. The originality of this research can be outlined as two keys: (1) to propose a data preprocessing method to extract the relevant data features in both numerical and textual crowdfunding datasets and prepare them for a predictive analysis processing, (2) design and implement predictive machine learning applications through a rigorous and comparative data-driven approach. For this, several classifications and text-mining machine learning models are trained, evaluated and optimised to find the best-fitted numerical and textual prediction approaches in crowdfunding applications.

3.1 Dataset selection

This research uses an original dataset [10] which is comprised of 24,100 records and 103 features for "Technology" in the "Design & Tech" Kickstarter category from April 2009 until February 2020. Kickstarter is a major online crowdfunding data source for various creative projects since its establishment in April 2009. Its records are continuously updated by worldwide crowdfunding projects during the fundraising process. However, it results in unexpected predictions/results if an update happens during the machine learning model training. To prevent any discrepancies, a new binary feature (updateBeforeFund) is derived from updatetime and end-time features to indicate whether the update occurs before the fundraising begins.

This dataset is divided into numerical and textual parts used to train the machine learning models for Success and Delivery prediction. The Success outcome variable exists in the dataset to indicate the project’s funding status. However, Delivery is derived from the project completion features and added to the dataset to show if the creator has achieved the initial project goals and delivered the planned rewards to the backers. It updates the total number of features to 104. The quality of the dataset is evaluated via Facets [13] in terms of noise and missing values. According to the results, the data distribution is normal/uniform (with no data distribution outliers) for most features.

3.2 Numerical data analysis

Machine Learning classification is used to analyse the numerical dataset and predict two targets including Success and Delivery. The following outlines the process:

3.2.1 Numerical data preprocessing

The numerical data preprocessing employs three techniques including one-hot encoding, feature standardisation and feature selection to prepare the numerical dataset for machine learning analysis. One-hot encoding [16] is used to convert a categorical string feature, subcategory with 17 subcategories, into a binary vector to suit the machine learning models. Equation 1 demonstrates how the subcategory feature (x) is transformed by mapping each sample’s values ($x_i \in \{c_1, c_2, \ldots , c_{17}\}$) into a 17-dimensional binary vector ($y_i$). These binary vectors are then aggregated across all samples to form a matrix (Y) which is subsequently appended to the dataset.

A Log Transformation technique [40] is used to standardise the features with abnormal distributions. As to Table 2 shows, the original skewness values of five features including goal, reward, backer, newBacker, and returningBacker are abnormal. By this, they are transformed into a natural logarithm format to be closer to the normal data distribution.

$$\begin{aligned} & y_i = \begin{bmatrix} \mathbb {I}(x_i = c_1), \mathbb {I}(x_i = c_2), \ldots , \mathbb {I}(x_i = c_{17}) \\ \end{bmatrix}, \mathbb {I}(x_i = c_j) = {\left\{ \begin{array}{ll} 1, & \text {if } x_i = c_j, \\ 0, & \text {otherwise.} \end{array}\right. }\nonumber \\ & \mathbb {Y} = \begin{bmatrix} \mathbb {I}(x_1 = c_1) & \mathbb {I}(x_1 = c_2) & \ldots & \mathbb {I}(x_1 = c_{17}) \\ \mathbb {I}(x_2 = c_1) & \mathbb {I}(x_2 = c_2) & \ldots & \mathbb {I}(x_2 = c_{17}) \\ \vdots & \vdots & \ddots & \vdots \\ \mathbb {I}(x_m = c_1) & \mathbb {I}(x_m = c_2) & \ldots & \mathbb {I}(x_m = c_{17}) \end{bmatrix}. \end{aligned}$$

(1)

Table 2 Skewness of the abnormal features

Full size table

The Pearson correlation coefficient technique [3] is used for feature selection by measuring the correlation between data features and targets (i.e. Success and Delivery). It is used to determine the meaningful/dependant variables and remove the irrelevant/meaningless ones from the dataset to train the machine learning models. According to [25], an absolute value of 0.02 is the minimum magnitude of correlation score for feature selection. Figs. 3 and 4 show Pearson correlations between the dependent features and the Success and Delivery targets. Positive correlations are marked in red, with darker shades indicate stronger positive correlations, while negative correlations appear in blue, with deeper tones represent stronger negative correlations. Yet, correlation values diminish as the colour approaches white.Table 3 lists the top features selected from both the Success and Delivery datasets, ordered by their Peterson correlation scores. Applying the proposed preprocessing method results in a numerical dataset comprising 18,514 records and 119 features with Success and Delivery outcomes representing 29.2% and 25.1% of the data respectively.

Table 3 Selected features

Full size table

3.2.2 Machine learning model training

The numerical dataset is partitioned as training (80%) and test (20%) parts, and a 5-fold cross-validation approach is used to reduce the impact of data dependency, ensure robust validation of the results, and mitigate the risk of overfitting. Five machine learning classification models including SVM [37], KNN [38], MLP [57], Random Forest [24] and Gradient Boosting Classifier (Boost Tree) [34] are trained and tested to figure out the best-fitting predictive approach in crowdfunding applications. The following outlines the rationale of the benchmark models:

KNN is interpretable and easy to understand and implement. It is non-parametric and makes no assumptions about the underlying data distribution. KNN is well-suited for non-linear data, where distribution is unknown or complex (e.g., crowdfunding data).
SVM has the advantage of being versatile because it adapts to a variety of data patterns with customisable kernel functions and has the capacity to resolve non-linear classification problems both in a high and low-dimensional spaces [32]. It also offers robustness to overfitting as maximises margin to reduce overfitting and improve generalisation.
Random Forest is a popular model for non-linear classification especially if needed to handle a mix of categorical and numerical data (e.g., Kickstarter) without extensive preprocessing. It can perfectly fit the input–output relationships with high complexity if an unlimited number of tree estimators are trained [14]. Moreover, Random Forest is scalable and robust to overfitting and outliers in complex datasets.
MLP, the universal function approximation, is used as it has the capacity to resolve complicated classification problems, and capture non-linear interactions from complex data (e.g., Kickstarter) with various data types.
Boosted Tree is scalable and highly flexible with complex datasets. It is capable of fitting a highly accurate prediction function by utilising sequential error correction, which allows the ensemble to focus on the most difficult-to-classify instances, resulting in precise and refined classifications over time [44].

The results of the machine learning models are evaluated and compared in terms of accuracy, precision, recall, F1 score, and training latency to find the best-fitted approach.

3.3 Textual data analysis

Three text-mining techniques including BERT [11], Word2vec [21], and TF-IDF [9] are used to analyse textual data (i.e. Story and Story&Risk) and predict Success and Delivery.

3.3.1 Dataset preprocessing

The textual dataset is initially formed with three features including Story, Risk and Story&Risk. They include general project information and the description of potential risks for each Kickstarter project. Story introduces a Kickstarter project in terms of the project creator, idea, and plan, while Risk provides an analysis of the potential risks (e.g., the technical problems) and challenges. Risk contains 3.28% missing values, but Story and Story&Risk features have no missing values.

The textual dataset is preprocessed and cleaned using a regular expression and word replacement approach [2]. It collects the textual data features from the original Kickstarter dataset and removes HTML tags, video statements, and stop words from the textual records to form a high-quality textual dataset with 18,514 records and three features: Story, Risk, and Story&Risk. “Appendix A” outlines a few samples for Story and Risk features.

3.3.2 Text-mining model training

TF-IDF is a basic domain-independent text analysis approach that widely used to evaluate the performance of text-mining benchmarks. It aids in enhanced feature selection by prioritising the most relevant/meaningful words in the text, while de-emphasising less impactful ones. This research builds a TF-IDF model using TfidfVectorizer and MultinomialNB functions. The former creates TF-IDF word vectors, while the latter builds a multinomial Naïve Bayes classification model to transform word vectors into numerical statistics. The performance of Multinomial Naïve Bayes is optimised through parameter tuning as alpha = 1.0, fit_prior = True, class_prior = None.

Word2Vec is a scalable text-mining technique that facilitates generalisation through deep learning methods to capture semantic meanings and preserve word relationships which are essential for interpreting crowdfunding texts. Fig. 5 shows the proposed neural network architecture to enhance the Word2Vec model in this article. It uses Adam optimiser and categorical cross-entropy loss function. As the input layer is unbounded to fixed-sized batches, the model flexibility is enhanced to feed arbitrarily long batches. The followings outline the architecture of the neural network model:

1.
An embedding layer to transform input values (i.e. vocabulary size is 3725) to fixed-size vectors of 100.
2.
An 1D convolution layer with 32 output filters, a window size of 5 and "ReLU" activation function.
3.
A Max-pooling layer for 1D temporal data (window size of 2).
4.
A Flatten layer that flattens the input and connects the dense layers in the neural network.
5.
A Dense layer with "ReLU" activation function to set the output dimension as 16.
6.
A Dense layer with "Softmax" activation function to set the output dimension as 2.

BERT is robust to noise and has the capacity to effectively manage text ambiguity and polysemy in complex and dynamic textual datasets such as Kickstarter. Its bidirectional architecture enables a deeper understanding of a word context by analysing both the preceding and following words in a sentence/text. A well-known pretrained BERT model, named multicased$\_L-12\_H-768\_A-12$, is used in this research to analyse the textual dataset. This model has the capacity to analyse a variety of 104 languages [28].

4 Results

This section evaluates the performance of the numerical and textual analysis approaches to predict Success and Delivery. For this, five metrics including Accuracy, Precision, Recall, F1 score, and Training latency are measured and compared. They are introduced as follows:

1.
Accuracy: is the percentage of True predictions based on the total number of predictions. It is the most commonly used measure to evaluate the performance of a model. Equation 2 calculates the model accuracy, where TP is the number of True Positive predictions (e.g., successful predicted delivery), FP shows False Positive predictions (e.g., undelivered project, but predicted as delivered), TN refers to True Negative predictions (e.g., true predicted undelivered project), and FN is False Negative predictions (e.g., a delivered project that predicted as undelivered).
$$\begin{aligned} Accuracy = \frac{TP+TN}{TP+FP+TN+FN} \end{aligned}$$
(2)
2.
Precision: is the percentage of positive instances to the total number of positive instances predicted. It is calculated as:
$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
(3)
3.
Recall: is the percentage of positive instances to the total number of actual positive instances.
$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(4)
4.
F1 score: is the harmonic mean of Precision and Recall. It is usually used to compare the performance of classifiers. As Eq. 5 shows, the F1 score increases if the precision score and recall score increase.
$$\begin{aligned} F1 = \frac{2*precision*recall}{precision+recall} \end{aligned}$$
(5)
5.
Training latency: measures the machine learning model training delay. It is influenced by model complexity (i.e. training parameters) and the size of the train dataset [27]. The training latency was recorded for all the experiments conducted on a workstation configured with an Intel Core i7 processor, 16 GB of RAM, and an NVIDIA GeForce GTX 1650 GPU with 4 GB of GDDR5.

4.1 Numerical data analysis

This section evaluates the performance of the numerical machine learning models to predict Success and Delivery targets. For both targets, the possible number of combination of methods, settings and hyper-parameters to explore is very large. Therefore, to make this task tractable, the analysis is broken down into two stages. Firstly, an initial comparison of the five classifiers is made with basic settings (e.g. linear kernel for SVM). Secondly, the most promising of this set is taken and fine-tuned by considering other settings and hyperparameter values.

4.1.1 Success prediction

Table 4 compares the results of five trained classifiers including SVM, KNN, MLP, Random Forest and Boost Tree. As it shows, the trained (linear) SVM outperforms the benchmarks in terms of Accuracy, Precision, Recall, and F1 score.

Table 4 The results of 5-classifications in success prediction

Full size table

4.1.2 SVM tuning and optimisation for success prediction

The SVM model gives the best performance as compared to the other benchmarks. However, the model performance is improved if the kernel function and regularisation parameters (i.e. degree and penalty parameter) are tuned. The hyper-parameters are introduced as below:

The performance of the SVM model is evaluated based on four kernel functions: Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid. The linear kernel only allows for a linear relationship between features and outcome, whereas the Polynomial kernel allows for non-linear model training. The RBF kernel function (also called the Gaussian kernel radial basis function) is often used to support complex datasets with no previous knowledge. The Sigmoid function works similar to a two-layer perceptron neural network model.
The Degree refers to the degree of the Polynomial kernel, which controls the flexibility of the decision boundary. The Degree value varies as 1, 2, and 3 in this experiment.
C is the penalty parameter that is used to control the error rate in SVM. The performance of the trained SVM is tested by C values of 0.01, 1.0, 5.0 and 10 to obtain the best results.

As Table 5 shows, Polynomial-SVM gives the best performance for Success prediction as compared to other kernels. However, the Sigmoid kernel underperforms others and gives a very slow model training. It is because the performance of the Sigmoid kernel function is similar to two-layer perceptron neural networks with high latency and low accuracy.

Table 5 SVM’s kernel tuning for success prediction

Full size table

Table 6 shows the results of Polynomial-SVM if Degree changes. As it shows, the model training latency is increased and the accuracy is reduced when the Degree value is increased. It suggests that the dataset tends to be more linearly separable.

Table 6 Polynomial-SVM’s degree tuning for success prediction

Full size table

Table 7 demonstrates the impact of penalty parameter (C) on the Polynomial-SVM model with the degree of 1. For this, the performance of the Polynomial-SVM model is evaluated according to four C values including 0.01, 1.0, 5.0, and 10.0. As the results show, the Polynomial-SVM model with degree 1 and penalty parameter of 5.0 gives the best performance.

Table 7 Polynomial-SVM’s C tuning for success prediction

Full size table

4.1.3 Delivery prediction

According to Table 8, the linear SVM slightly outperforms MLP and Boosted tree and gives a better performance for Delivery prediction.

Table 8 The results of 5-classifications in delivery prediction

Full size table

4.1.4 SVM tuning and optimisation for delivery prediction

This section aims to improve the performance of the trained SVM via the kernel function, Degree, and C parameter tuning. According to Table 9, Linear, RBF and Polynomial kernel functions give similar results. However, RBF-SVM outperforms linear kernel functions as it increases the function space to support higher flexibility in non-linear datasets. In addition, RBF-SVM reduces the model training time as compared to Polynomial-SVM. It is because of inner products of feature vectors in Polynomial-SVM results in increased training delay.

Table 9 SVM’s kernel tuning for delivery prediction

Full size table

Table 10 RBF-SVM’s degree tuning for delivery prediction

Full size table

The RBF-SVM is tuned with the Degree values of 1, 2, and 3. As Table 10 shows, the RBF-SVM with the Degree of 1 slightly outperforms others. Indeed, the kernel function becomes complicated and slow when the Degree value is increased.

The performance of RBF-SVM with the Degree of 1 is evaluated when the C value changes. As Table 11 shows, the model gives the best performance when C is 1.0. It is because this dataset is more linearly inseparable. However, it results in higher misclassified outliers if the C drops below 1.0, while worse prediction performance is achieved when C is increased above 1.0.

Table 11 RBF-SVM’s penalty parameter tuning for delivery prediction

Full size table

4.2 Text-mining analysis

This section uses text-mining techniques to analyse textual datasets including Story and Story&Risk to predict Success and Delivery. Figure 6 demonstrates 50-top keywords for both Story and Story&Risk datasets in Success, while Fig. 7 demonstrates the word cloud in Delivery prediction.

Three well-known text-mining approaches including BERT, TF-IDF and Word2vec are used to analyse the project storytellings. TF-IDF and Word2vec models split the datasets as train (75%) and test (25%). However, BERT partitions the datasets into three parts including: train, test and dev. Indeed, it splits the training dataset into two parts according to a ratio of 3 to 1. The train part is used to train the model, while dev is used to test the efficiency of hyper-parameters and model evaluation, and leaving test for independent testing [11].

Table 12 shows the textual analysis results for Success prediction. As it shows, BERT slightly outperforms TF-IDF and Word2vec if the Story&Risk feature is used, while Word2vec outperforms others if Story is analysed. It suggests that Word2vec has better performance if a single and high-quality feature (i.e. Story) is used for prediction. On the other hand, BERT outperforms Word2vec when a complex and non-linear feature (i.e. Story&Risk) is used. According to the results, BERT is very slow for this predictive analysis as it is established based on a multilayer bidirectional transformer and requires fine-tuning [29].

Table 12 Textual analysis for “success” prediction

Full size table

Table 13 shows the results of the text-mining techniques for Delivery prediction. According to it, BERT outperforms other benchmarks if the Story&Risk feature is used, whereas Word2vec gives the best results if Story is analysed.

Table 13 Textual analysis for “delivery” prediction

Full size table

In both Tables 12 and 13, accuracy is higher than the F1 score due to the imbalance between Success and Delivery class labels. Class imbalance leads to inflated accuracy values as the model predominantly tends to predict the majority class. However, the F1 score provides a more balanced assessment of performance as it is less influenced by class imbalance. However, refining the proposed data preprocessing approach by incorporating additional techniques to address the class imbalance remains a topic for future work as outlined in Sect. 6.

5 Discussion

The results give rise to several implications. Here, comparisons with other studies and theoretical contributions are discussed, along with implications for project creators and backers and future directions and limitations.

5.1 Comparison of results

This study aimed to explore three research questions (listed in Sect. 1) and these are now reviewed, in contrast to previous studies, as follows.

RQ1- As the results show, the proposed data preprocessing approach improves the performance of both numerical and textual crowdfunding data analysis applications.
RQ2- This study proposes an extensive data-driven evaluation approach through which the performance of five numerical machine learning techniques and three textual machine learning models are evaluated and compared to determine the best-fitting approach for machine learning-enabled crowdfunding applications. For prediction of Success, this study gives a higher accuracy (95.76%) compared to SVM (92%) used in other studies [43] and MLP (93%) [58] models for Success prediction. It shows that the trained SVM fits the classification applications in non-linear datasets with numerous features well. Since the performance of SVM and MLP is higher for the same problem using the same techniques in [43] and [58], it suggests that the uplift is due to the additional preprocessing steps, including feature selection, introduced in this study. However, there are fewer research articles studying the prediction of Delivery to compare with the results of this study. Tran et al. [50] use XGBoost to model on-time Delivery and achieve 82.5% accuracy. In contrast, this study achieved 90.20% accuracy with the Gradient Boosting Classifier which suggests that some of the uplift is due to the additional preprocessing steps introduced in this study, including feature selection.
RQ3- This study has presented an experimental set-up to analyse several text-mining analysis methods and given results applying these different methods. The results show that BERT and Word2vec improve marginally on a simple TD-IDF approach. Since Nam et al. [33] also use TF-IDF on Kickstarter data, this allows comparison with the results reported here, when using the same method. Table 12 shows that the TF-IDF using Naïve Bayes classifier in this study achieves 77.18% and 77.86% for Success prediction, whereas [33] achieves 58.13% in the same Naïve Bayes setting. Hence this suggests that much of the improvement in performance is due to the preprocessing steps introduced in this study.

5.2 Theoretical contributions

Several contributions for innovation research emerge from this analysis. Firstly, the numerical data models and text-mining models with better predictive power have been selected, which provide important value creation mechanisms for the creators and backers to predict fundraising success and project delivery under different circumstances.

Secondly, the models are constructed to predict the performance of product delivery in addition to the fundraising success based on higher dimension of data, which are able to detect interactions and nonlinearities in the complex business world. It helps generate new theoretical and empirical insights to the strand of studies in understanding the influencing mechanisms of financial technology in business and innovation.

Thirdly, methodologies introduced in this study have advantages over both parametric regressions and statistical learning techniques that rely on strong modelling assumptions, and are subject to less uncertainty and help to reduce researchers’ discretion in traditional business research. There are five numerical data models and three text-mining models built to be tested and compared in real time. The numerical and text-mining models takes into consideration of the 104 numerical features and the story and risk telling of the crowdfunding projects, waiving away the bias caused by feature selection and weight controlling in the linear regression models.

5.3 The research limitations

This study still suffers the limitations of data availability and generalisability. This research could include social media data to train the numerical and textual models, instead of relying exclusively on Kickstarter data. Social media plays a key role in crowdfunding as project creators often use these platforms to advertise their projects and interact with potential backers. Using social data (e.g., follower interactions and engagement patterns), the data analysis could yield richer insights and improved accuracy. In addition, the generalisability of this study still needs to be enhanced to cover a broader range of crowdfunding scenarios and applications. For this, a more robust and comprehensive dataset could be created by using additional samples/features (i.e. user reviews) which are excluded in Kickstarter’s Story and Risk data but available in other crowdfunding datasets.

6 Conclusion and future directions

This research proposes an extensive data-driven numerical and textual predictive analysis to evaluate the status of reward-based crowdfunding projects in terms of Success and Delivery. It offers the project creators and backers benefits to estimate the project’s Success and Delivery in advance. This article uses a well-known dataset with 24,100 records and 103 features which is crawled from the Design&Technology category on the Kickstarter online platform. This dataset is preprocessed to handle noise, missing values and with feature selection. It is divided into two parts numerical and textual: the former trains five machine learning techniques including Support Vector Machine (SVM), K-Nearest neighbour (KNN), Multilayer Perceptron Classifier (MLP), Random Forest, and Gradient Boosting Classifier, whereas the latter is used by three well-known text-mining techniques (TF-IDF, Word2Vec, and BERT) to find the best predictive models.

The experiments show that the proposed data preprocessing and numerical-textual model training are able to improve Success and Delivery predictions in crowdfunding applications. As the numerical results show the trained SVM model outperforms the existing models and research benchmarks. It demonstrates that Polynomial-SVM has the best performance for a linear prediction (Success), while RBF-SVM gives the best results for a non-linear prediction (Delivery). The textual analysis outlines that BERT is a slow method that fits complex text features. In contrast, Word2vec works better if the dataset is simple and high-quality. It supports the literature, however, with improved results.

This study only focuses on reward crowdfunding, but it also demonstrates that the numerical models and text-mining models can be adopted to other financing channels with numerical features and text descriptions. These settings include other crowdfunding platforms (like patronage crowdfunding, equity crowdfunding and debt crowdfunding), bonds, IPOs (initial public offerings), rights offering and additional share issuing. The numerical features available along with the text descriptions, such as prospectus and listing memorandum, may be used to construct machine learning models and better predict the financial or economic performance. Future researchers are encouraged to investigate the possibilities of developing theoretical and empirical studies on these issues.

A multi-target prediction approach should be used to extend the numerical analysis in the future. It allows the project’s stakeholders to analyse the status of the crowdfunding projects based on several targets -mainly Success and Delivery at the same time. However, it may suffer from non-linear correlations between the targets, especially when the numerical dataset is big and complex. Output Kernel Learning (OKL) or multilayer prediction techniques could be studied and used to address this issue. Additionally, Deep Learning methods could also be explored for this problem as the base machine learning approach.

The impact of the preprocessing method on the performance of the machine learning models needs further studies especially, where data preprocessing optimisation is an objective. Additionally, the current preprocessing approach requires additional refinement, particularly through the integration of advanced algorithms such as Levenshtein Distance and Phonetic matching. These techniques are crucial to automatically identify and correct misspellings and informal language which are common in user-generated content like Kickstarter datasets. Furthermore, there is a clear need to explore resampling and data augmentation strategies to address the class imbalance (i.e. Success and Delivery labels) in such datasets and enhance the model robustness.

Availability of data and materials

All data/materials are available from the corresponding author on request.

References

Appio FP, Leone D, Platania F, Schiavone F (2020) Why are rewards not delivered on time in rewards-based crowdfunding campaigns? An empirical exploration. Technol Forecast Soc Change 157:120069
Article Google Scholar
Ardakani SP, Zhou C, Wu X, Ma Y, Che J (2021) A data-driven affective text classification analysis. In: 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Virtually Online, 13–15 Dec 2021
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise Reduction in Speech Processing. Springer, pp 1–4
Brem A, Bilgram V, Marchuk A (2019) How crowdfunding platforms change the nature of user innovation-from problem solving to entrepreneurship. Technol Forecast Soc Change 144:348–360
Article Google Scholar
Burtch G, Hong Y, Liu D (2018) The role of provision points in online crowdfunding. J Manag Inf Syst 35(1):117–144
Article Google Scholar
Butticè V, Ughetto E (2021) What, where, who, and how? A bibliometric study of crowdfunding research. IEEE Trans Eng Manag. https://doi.org/10.1109/TEM.2020.3040902
Article Google Scholar
Calic G, Shevchenko A (2020) How signal intensity of behavioral orientations affects crowdfunding performance: the role of entrepreneurial orientation in crowdfunding business ventures. J Bus Res 115:204–220
Article Google Scholar
Cappa F, Franco S, Ferrucci E, Maiolini R (2021) The impact of product and reward types in reward-based crowdfunding. IEEE Trans Eng Manag. https://doi.org/10.1109/TEM.2021.3058309
Article Google Scholar
Cheng C, Tan F, Hou X, Wei Z (2019) Success prediction on crowdfunding with multimodal deep learning. In: IJCAI, pp 2158–2164
Design & tech - kickstarter. https://www.kickstarter.com/design-tech?ref=section-homepage-nav-click-design-tech
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Ellman M, Hurkens S (2019) Optimal crowdfunding design. J Econ Theory 184:104939
Article MathSciNet Google Scholar
Facets—visualizations for ML datasets. https://pair-code.github.io/facets/. Accessed Aug 2021
Feng W, Ma C, Zhao G, Zhang R (2020) FSRF: an improved random forest for classification. In: 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA). IEEE, pp 173–178
Gafni H, Marom D, Sade O (2019) Are the life and death of an early-stage venture indeed in the power of the tongue? Lessons from online crowdfunding pitches. Strateg Entrep J 13(1):3–23
Article Google Scholar
Guo Y, Zhou X, Zhan C, Zeng Y, Zhong L (2020) Prediction and analysis of success on crowdfunding projects. In: Proceedings of the 2020 4th International Conference on Electronic Information Technology and Computer Engineering, pp 785–789
Hua X, Huang Y, Zheng Y (2019) Current practices, new insights, and emerging trends of financial technologies. Ind Manag Data Syst 119:1401–1410
Article Google Scholar
Indiegogo. https://www.indiegogo.com/. Accessed Sept 2021
Jhaveri S, Khedkar I, Kantharia Y, Jaswal S (2019) Success prediction using random forest, catboost, xgboost and adaboost for kickstarter campaigns. In: 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC). IEEE, pp 1170–1173
Kamath RS, Kamat RK (2016) Supervised learning model for kickstarter campaigns with R mining. Int J Inf Technol Model Comput. https://doi.org/10.2139/ssrn.3513341
Article Google Scholar
Kathuria RS, Gautam S, Singh A, Khatri S, Yadav N (2019) Real time sentiment analysis on twitter data using deep learning (keras). In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). IEEE, pp 69–73
Kickstarter. What are the basics? [EB/OL]. https://help.kickstarter.com/hc/en-us/articles/115005028514-What-are-the-basics-. Accessed 26 Sept 2021
Koch J-A, Siering M (2015) Crowdfunding success factors: the characteristics of successfully funded projects on crowdfunding platforms
Li X et al (2013) Using random forest for classification and regression. Chin J Appl Entomol 50(4):1190–1197
Google Scholar
Lin C, Miller T, Dligach D, Plenge RM, Karlson EW, Savova G (2012). Maximal information coefficient for feature selection for clinical document classification. In: ICML Workshop on Machine Learning for Clinical Data. Edingburgh, UK
Liu T, Gong X, Liu Z, Ma C (2021) Direct and configurational paths of capital signals to technology crowdfunding fundraising. IEEE Trans Eng Manag 34(1):30–44. https://doi.org/10.1109/TEM.2021.3068524
Article Google Scholar
Maeda T (2018) How to rationally compare the performances of different machine learning models? Technical report, PeerJ Preprints
Maharani W (2020) Sentiment analysis during Jakarta flood for emergency responses and situational awareness in disaster management using BERT. In: 2020 8th International Conference on Information and Communication Technology (ICoICT). IEEE, pp 1–5
Mayfield E, Black AW (2020) Should you fine-tune BERT for automated essay scoring? In: Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp 151–162
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp 3111–3119
Mollick E (2014) The dynamics of crowdfunding: An exploratory study. J Bus Ventur 29(1):1–16. https://doi.org/10.1016/j.jbusvent.2013.06.005
Article Google Scholar
Murty MN, Raghava R (2016) Kernel-based SVM. In: Support Vector Machines and Perceptrons. Springer, pp 57–67
Nam S, Jin Y, Kwon O (2018) Online document mining approach to predicting crowdfunding success. J Intell Inf Syst 24(3):45–66
Google Scholar
Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobot 7:21
Article Google Scholar
Patreon. https://www.patreon.com/. Accessed Sept 2021
Pati R, Garud N (2021) Social interaction and crowdfunding project success: moderating roles of product development stage and product innovativeness. IEEE Trans Eng Manag. https://doi.org/10.1109/TEM.2021.3061532
Article Google Scholar
Patle A, Chouhan DS (2013) SVM kernel functions for classification. In: 2013 International Conference on Advances in Technology and Engineering (ICATE). IEEE, pp 1–9
Peterson LE (2009) K-nearest neighbor. Scholarpedia 4(2):1883
Article Google Scholar
Provost F, Fawcett T (2013) Data science for business. O’Reilly, Sebastopol
Google Scholar
Putrì I, Septiana IS, Mahendra R et al (2017) Estimating the collected funding amount of the social project campaigns in a crowdfunding platform. In: 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS). IEEE, pp 277–282
Qaiser S, Ali R (2018) Text mining: use of tf-idf to examine the relevance of words to documents. Int J Comput Appl 181(1):25–29
Google Scholar
Ren J, Raghupathi V, Raghupathi W (2021) Exploring the subjective nature of crowdfunding decisions. J Bus Ventur Insights 15:e00233
Article Google Scholar
Sawhney K, Tran C, Tuason R (2016) Using language to predict kickstarter success
Sciandra A (2020) COVID-19 outbreak through tweeters’ words: monitoring Italian social media communication about COVID-19 with text mining and word embeddings. In: 2020 IEEE Symposium on Computers and Communications (ISCC). IEEE, pp 1–6
Song C, Luo J, Hölttä-Otto K, Seering W, Otto K (2020) Crowdfunding for design innovation: prediction model with critical factors. IEEE Trans Eng Manag. https://doi.org/10.1109/TEM.2020.3001764
Article Google Scholar
Statista (2021) Market size of crowdfunding worldwide in 2020 with a forecast for 2027. [EB/OL], a. https://www.statista.com/statistics/1078273/global-crowdfunding-market-size/. Accessed 26 Sept 2021
Statista (2021) Cumulative amount of funding pledged to kickstarter projects as of July 2021. [EB/OL], b. https://www.statista.com/statistics/310218/total-kickstarter-funding/. Accessed 26 Sept 2021
Strausz R (2017) A theory of crowdfunding: a mechanism design approach with demand uncertainty and moral hazard. Am Econ Rev 107(6):1430–1476
Article Google Scholar
Tian J (2021) Do you want to foresee your future? The best model predicting the success of kickstarter campaigns. In: 2021 13th International Conference on Machine Learning and Computing, pp 223–231
Tran T, Lee K (2017) Characteristics of on-time and late reward delivery projects. In: Proceedings of the International AAAI Conference on Web and Social Media, vol 11, no 1, pp 676–679. https://ojs.aaai.org/index.php/ICWSM/article/view/14965
Tran T, Lee K, Vo N, Choi H (2017) Identifying on-time reward delivery projects with estimating delivery duration on kickstarter. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pp 250–257
Tuo G, Feng Y, Sarpong S (2019) A configurational model of reward-based crowdfunding project characteristics and operational approaches to delivery performance. Decis Support Syst 120:60–71
Article Google Scholar
Wang W, Zheng H, Wu J (2020) Prediction of fundraising outcomes for crowdfunding projects based on deep learning: a multimodel comparative study. Soft Comput 24(11):8323–8341
Article Google Scholar
Wang Y, Yi F, Hu J (2019) The determinants of reward-based crowdfunding project delivery performance: a configurational model based on latent Dirichlet allocation. In: IOP Conference Series: Materials Science and Engineering, vol 688. IOP Publishing, pp 055073
Wehnert P, Beckmann M (2022) Crowdfunding for a sustainable future: a systematic literature review. IEEE Trans Eng Manag. https://doi.org/10.1109/tem.2021.3066305
Article Google Scholar
Xu L, Wu Q, Du P, Qiao X, Tsai S-B, Li D (2018) Financing target and resale pricing in reward-based crowdfunding. Sustainability 10(4):1297
Article Google Scholar
Yilmaz I, Kaynar O (2011) Multiple regression, ANN (RBF, MLP) and ANFIS models for prediction of swell potential of clayey soils. Expert Syst Appl 38(5):5958–5966. https://doi.org/10.1016/j.eswa.2010.11.027
Article Google Scholar
Yu P-F, Huang F-M, Yang C, Liu Y-H, Li Z-Y, Tsai C-H (2018) Prediction of crowdfunding project success with deep learning. In: 2018 IEEE 15th International Conference on E-Business Engineering (ICEBE). IEEE, pp 1–8
Yuan H, Lau RYK, Xu W (2016) The determinants of crowdfunding success: a semantic text analytics approach. Decis Support Syst 91:67–76
Article Google Scholar
Zheng H, Xu B, Wang T, Chen D (2017) Project implementation success in reward-based crowdfunding: an empirical study. Int J Electron Commer 21(3):424–448. https://doi.org/10.1080/10864415.2016.1319233
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

School of Engineering and Physical Sciences, University of Lincoln, Lincoln, LN6 7TS, UK
Saeid Pourroostaei Ardakani
Lingnan College, Sun Yat-sen University, Guangzhou, 510275, China
Jianwei Hu
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, 15213–3890, USA
Jing Zhang
Department of Computer Science and Engineering, University of California San Diego, California, CA, 92093, USA
Kaifeng Jin
School of Computer Science, University of Nottingham Ningbo China, Ningbo, 315100, China
Tianhong Cai & Anthony Graham Bellotti
Nottingham University Business School China, University of Nottingham Ningbo China, Ningbo, 315100, China
Xiuping Hua

Authors

Saeid Pourroostaei Ardakani
View author publications
You can also search for this author inPubMed Google Scholar
Jianwei Hu
View author publications
You can also search for this author inPubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Kaifeng Jin
View author publications
You can also search for this author inPubMed Google Scholar
Tianhong Cai
View author publications
You can also search for this author inPubMed Google Scholar
Anthony Graham Bellotti
View author publications
You can also search for this author inPubMed Google Scholar
Xiuping Hua
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

The authors contributed to each part of this paper equally. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Saeid Pourroostaei Ardakani.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Ethics approval and consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: story and risk textual data samples

Project Name	Story	Risk
Portfolio - A better NHS portfolio	...We are creating a new portfolio - one that’s guided by the needs of the you, one that’s simple and enjoyable to use, that feels modern and fresh.	The biggest challenge to oPortfolio is adoption by deaneries and colleges.
“Tailly”, the tail that wags when you get excited	We have built a prototype of the moving tail, “Tailly”, which uses sensors on the inside of the belt to measure the user’s heart rate.	There are always risks involved in designing and manufacturing new products. We would like to take on the challenge of overcoming these risks, and we feel that there are two major challenges with “Tailly”:1) Measuring heart rate...2) The strength of the product.
PiCE, The Ultimate Case For Your Raspberry Pi & Camera	Hi, we’re Elson Designs and this is PiCE, the ultimate case for your Raspberry Pi.PiCE isn’t cute, it’s not sweet. It’s rugged, industrial design keeps your Raspberry Pi safe, even from water! Yet, PiCE still allows you to use your Pi and the Pi Camera as the Raspberry Pi foundation intended. You can even use it outside!.	We have a few main risks to the project, such as: Delays in the tooling and initial batches...Another risk we have would be defects in the design.
Learn Concrete5	Concrete5 is the fastest growing Content Management System (CMS) in the world right now. It is a great open source platform for site management and blogging.	I have been in the industry for over 22 years. I am an active member of the Concrete5 community and we use this system each and every day with most websites produced using Concrete5 exclusively. I see no delays or challenges to what we propose. The risks are minimal.
Schema Migrations for Django	...Schema migration with Django has had a long and complex history, but for the last few years South has become the go-to choice. Now, with South’s four-year-old design hitting serious limits, it’s time to add migration support into Django itself.	Software development always carries a few risks - in particular, unexpected edge cases and design flaws - but in this case my previous experience writing South and the year or so of planning and discussion means that most of the kinks have been worked out by now. There’s also the risk of the project not making it into a release of Django in a timely fashion, but as long as it proceeds relatively on-time and gets enough code review it should make it. If the funding goes above the set limit, I’ll use some of it to help compensate any code reviewers for their time.
OSCNC - Open source CNC Machine using Mach3 / Linux CNC	...The goal is to produce a complete CNC machine based on one of our existing CNC machines allowing a cheap but reliable machine to be built.	There are a few risks associated with the project that may set back the delivery time, however we are a dedicated team who have been working on the similar projects for nearly ten years and have dealt with many hiccups along the way. With our new Kickstarter backers joining the team we hope that nothing could stand in our way.
Joomla Web Services	Joomla needs a RESTful web services API.It needs it badly because there is ever-increasing demand for content to be consumed across platforms and across devices.	The biggest risk is that the code is not stable enough to be included in the 3.2 release in September. Whilst disappointing it wouldn’t be the end of the world as it would simply be carried forward to the 3.5 release in March. The code will be available to download from the GitHub repository at any time anyway.
Observium Alerting	Observium is an Open Source, auto-discovering network monitoring platform written in PHP which supports a wide range of devices and operating systems.	The primary risk is that we don’t manage to fully implement the alerting system within the time afforded to us by the funding. Even in this situation, nothing will be wasted, any development work we’ve done will get us ever closer to having a finished, usable alerting system. We’ll get there, it might just take a little longer!
Trax: Next generation mini GPS tracker for Children and Pets	...Trax is a tiny smart personal GPS tracker that through an App can be located nearly anytime, from almost anywhere. Trax is easy to use, water proof and durable. It has a built in sensor that measures speed and direction, and provides the positions when the GPS signal is lost.	Our biggest challenges lie in the developing part of the project...Another potential risk lies within manufacturing and assembly.
LinuxonAndroid - To the Next Level!	LinuxonAndroid is a ongoing project run by a University student with a passion to get full blown Linux distros (Ubuntu, debian etc etc) running on top of Android devices!This opens a whole new world of possibilities to your Android device:Ever wanted to run a web server from a phone?.	The largest problems we will face are in making LinuxonAndroid as smooth as possible. As we all have bills to pay and roofs to keep over our heads the larger a fund we can raise the more resources we can commit to the project and the more likely features can be completed.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pourroostaei Ardakani, S., Hu, J., Zhang, J. et al. Identifying crowdfunding storytellers who deliver successful projects: a machine learning approach. J Supercomput 81, 263 (2025). https://doi.org/10.1007/s11227-024-06785-4

Download citation

Accepted: 27 November 2024
Published: 09 December 2024
DOI: https://doi.org/10.1007/s11227-024-06785-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Identifying crowdfunding storytellers who deliver successful projects: a machine learning approach

Abstract

Similar content being viewed by others

Predictive Analysis of Crowdfunding Projects

Experience mining based on text analytics and case-based reasoning to support crowdfunding design

Identifying the key success factors of movie projects in crowdfunding

1 Introduction

1.1 Research gaps and motivation

1.2 Contributions

2 Literature review

2.1 Crowdfunding

2.2 Crowdfunding and social applications

2.3 Data analysis in crowdfunding

3 Methodology

3.1 Dataset selection

3.2 Numerical data analysis

3.2.1 Numerical data preprocessing

3.2.2 Machine learning model training

3.3 Textual data analysis

3.3.1 Dataset preprocessing

3.3.2 Text-mining model training

4 Results

4.1 Numerical data analysis

4.1.1 Success prediction

4.1.2 SVM tuning and optimisation for success prediction

4.1.3 Delivery prediction

4.1.4 SVM tuning and optimisation for delivery prediction

4.2 Text-mining analysis

5 Discussion

5.1 Comparison of results

5.2 Theoretical contributions

5.3 The research limitations

6 Conclusion and future directions

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval and consent to participate

Additional information

Publisher's Note

Appendix A: story and risk textual data samples

Appendix A: story and risk textual data samples

Rights and permissions

About this article

Cite this article

Share this article

Keywords