Keywords

1 Introduction

With the pervasive adoption of smartphones and the ubiquitous availability of broadband, many companies engage with their customers through solely digital channels. This rapid adoption of app-powered businesses led Newsweek to coin the term App Economy in 2010 [1], referring to a growing digital economy.

Many of these businesses operate a freemium model; a tiered approach where the base application is free of charge whilst additional premium services are offered via in app purchases. This new business model has been operating for a number of years and generates a lot of user data. It is timely to review the data generated by such a business and to establish whether data mining and machine learning techniques can be applied to generate new insights to understand customer behavior, thereby enabling targeted customer marketing, and incentivisation.

In this paper, we have partnered with an industry partner: an App Economy company with a freemium investment app, but for reasons of confidentiality and data protection need to remain nameless. Their mission is to get the world investing successfully, for which they provide two apps, which we will refer to as “Sandbox”, and “Real”. The former helps users understand the fundamentals of investing in the absence of investor risk, and the latter is a showroom of recommended, hand-picked stocks by the Chief Investment Officer and an easy facility to track and invest in stocks.

This paper seeks specifically to understand which app usage behaviors in the Sandbox app correspond to (or indicate that) subscribers transitioning to the Real app stock advisory service. Aligned with this, there is to understand which customers in the non-paying cohort are most likely to subscribe to the premium Real service. This is a classification problem well suited for machine learning methods. App usage data is mined to identify what the leading App Usage indicators are for a user to subscribe, noting the significance of each indicator. Whilst there is some research literature in this area, there is no research as to how these classification algorithms can be applied for a freemium app of consumer investment advisory services.

Addressing this objective presents a number of challenges: (1) Apps have many users, and vast amounts of data but not much information. To remove barriers to adoption, no user profile or usage data is captured for the Sandbox App and limited user profile data is captured for the Real App. Hence, there is very limited demographic data on the user base. The primary data available is the Invest app event usage data. (2) Data quality: a number of fields in the user profile data have missing or inconsistent values. This is a common issue in freemium apps, where users often cloak or misrepresent their identity. The central challenge here corresponds to accurate building accurate user models and demographics. The underlying App Usage event data is considered to be more reliable. (3) High volumes of usage data: 10 million rows of event data. A user can generate one of over 200 events (e.g., saw stock details, favorited a stock). In isolation, each event record is meaningless. However, the vast array of data can be used to engineer meaningful features from the data, which can provide insight into user app engagement and service adoption patterns. (4) Data imbalance. The Real service for which we want to predict adoption has only been live since December 2016, whilst the App in general has been available for circa three years. Hence, naturally there is an imbalance in the data.

The Related Work section which follows describes other related research is this field, including how these challenges can be overcome. The insights from this research heavily influenced the approach and plan for this research project. In some cases, gaps in research are identified, some of which shaped the approach taken. These are set out in the methodology section which also details how the challenges are overcome. The Implementation section provides more details on the technical implementation of the project. An evaluation of the project results is provided, highlighting the algorithms which prove better for this situation. The report closes with conclusions and recommendations for future work in this field. Our industry partner will adopt the data mining and machine learning capability that was built for this research project and plan to address a number of the future work recommendations.

2 Related Work

There seems to be limited research conducted on data mining and machine learning to support transitioning users from freemium to premium in Apps. Other challenges such as feature engineering and managing imbalanced datasets are reviewed to help inform the data mining approach for this paper and to inform areas where research can be progressed through this work.

2.1 Freemium to Premium

Moving users from freemium to premium users is a key business objective, necessary for many companies to grow and sustain growth. Yet, there is limited research in this field. Sifa et al. [2] conducted research on predicting In-App purchase decisions in Mobile Free-to-Play games. Similar to our objective, they needed to manage class imbalance given that most users of these games have been non-paying. Using the Synthetic Minority Over-Sampling Technique-Nominal Continuous approach, they noted that Random Forest out-performed SVMs and Decision Trees with better recall (0.439) and precision (0.643) and F-Score (0.522). Unsurprisingly, models that ran with more data (e.g., 7 days of observations rather than 3) produced better results. Key important predictor variables of future purchases were noted as; number of previous purchases, amount spent previously, number of interactions with other players, levels progressed in game, device and playtime. The models provide insights into how to optimize marketing spend to a user’s profile. Furthermore, the models provide valuable information for the design of the product itself (such as number of levels and other game specific features). Whilst the model included many pertinent features such as user engagement and frequency of engagement, the model did not include any recency measures.

2.2 Feature Engineering

As there are no (reliable) demographic attributes of note that could be used, feature engineering of the underlying event data is critical to derive relevant input variables of the user app usage behavior. Domingos [3] notes that the features used are the most important factor determining the success or failure of machine learning projects. He notes that the raw data is not often in the format that can be easily consumed by the Machine Learning algorithms and a very significant amount of work is required to create the appropriate features. He emphasizes that feature engineering and learning are largely inter-dependent steps and a significant degree of iteration is required to get the right results: creating the features, passing them through the learner, analyzing the results, tweaking the features and going again. Domingos notes the importance of feature selection, indicating that a typical approach is to only include the features that have the best information gain for the class. However, he also cautions that features that look unimportant in isolation may in fact be very important in combination with other features. Kelleher et al. [4] echo this approach - rank and prune with filters. They highlight some other alternatives, but conclude that the rank and prune approach is faster and typically delivers models with good accuracy.

2.3 Marketing Science to Inform the Identification of the Right Features

Our industry partner uses a database marketing approach, using push notifications for their customer communications. McCarty and Hastak [5] advocate a RFM (Recency Frequency Monetary Value) model when high volume database marketing is used. It is based on the philosophy that the best customers to target with new offers are those who recently purchased from a marketer, those who purchase frequently from a marketer and those who spend money with a marketer. McCarthy et al. note that this model cannot be applied to new customers as there is no pre-existing transaction history. However, in our case, there is usage history. Hence, recency and frequency are considered to be key attributes to be engineered into our model.

2.4 Handling of Imbalanced Datasets

Our dataset has an imbalance of 0.85% Real users to 99.15% Sandbox only users, resulting in a need to manage this class imbalance. This extent of class imbalance problem falls into Weiss’s Relative Rarity problem [6]. Weiss noted the challenges caused by rarity for data mining and set out a range of scenarios and possible solutions. He strongly discourages simple oversampling but also does not strongly support under-sampling. With under-sampling, potentially valuable data is lost from the model whereas oversampling both increases the time to build the classifier and can cause over-fitting. Chawla et al. [7] indicate that how much to over or under sample is usually established empirically. Burez and Van den Poel [8] note positive results from under-sampling, whilst recognizing that SMOTE for over-sampling might give better results. Based on mixed views in the literature, an under-sampling approach has been adopted.

2.5 Machine Learning Algorithms

Several works informed the choice of appropriate machine learning algorithms. Support Vector Machines (SVMs) were considered a good choice as they can capture much more complex relationships between data points. Cui and Curry [9] use SVMs to help predict customer intention in a marketing study and the algorithm outperformed all others. SVM is used in many customer churn prediction studies, which is analogous to the research question in this paper, i.e., prediction of future behavior based on analysis of customer’s past actions. In their study of churn in the telecommunication field, Huang et al. [10] found that SVMs performed well on engineered features. This concurred with a similar study by Zhao et al. [11]. Xia and Jin [12] compare SVMs to several models and find they have the best accuracy and recall

Decision trees are also considered as they have been used in many customer prediction studies. Several studies have found that decision trees can deliver accurate churn prediction models by using customer data Hung et al. [13] and Bin et al. [14]. Decision trees have also been used in identifying high value customers with good results Han et al. [15]. GBM Gradient boosting is a machine learning method which builds an ensemble prediction model by combining weak prediction models. It is the idea of taking the wisdom of the crowd and averaging (or voting) across different models. It is often applied to decision trees. Van Wezel and PotHarst [16] show that GBM provides a large improvement over decision trees in customer choice prediction.

To better understand the dataset, clustering was applied to the majority class. Several studies have shown that under-sampling based on clustering performs better than other under-sampling methods such as Yen and Lee [17] and López et al. [18].

2.6 Model Evaluation

Lantz [19] advises that a good classification model finds a good balance between predicting in a very conservative way and being overly aggressive. He recommends a balanced set of measures including sensitivity (also known as recall), specificity and precision. These measures don’t focus on aggregate accuracy of the model but accuracy for each of the classes. The F-measure combines precision and recall into a single performance measure. Rosset et al. [20] show that these measures can be used to compare results across different algorithms. These measures are used to measure the performance of the algorithms used in the research project.

3 Methodology

The paper is based on the Cross-Industry Standard Process for Data Mining (CRISP-DM) [21], as this methodology provides a particular focus on business engagement and understanding at the outset. With the goal to deploy models within business operations.

3.1 Business Understanding

Key to this research is the business context of the data, and correspondingly the objectives of the study: namely identifying transitional indicators of Sandbox only customers into premium customers that use the Real app. This corresponds to a supervised machine learning context, but where the overarching objective is not the prediction per se, but the ability to identify robust indicators of transitioning customers such that core business and marketing strategies (and correspondingly departments) can be informed.

3.2 Data Understanding

Our partner extract data from mixpanel.com, the primary system holding user-level and app usage data for each user. Upon receiving data it was reviewed and a preliminary data dictionary was agreed to ensure domain correspondence and context. The data was subsequently transformed and loaded into a Microsoft SQL Server database. Here, it was reviewed and missing value issues and inconsistencies between the user level and event level data were identified. Given the absence of demographic data and the data quality issues, we concluded that the user-level data does not provide reliable predictors. The event level data was identified as having the greatest potential, with initial analysis of the event data highlighting some interesting facts about user behavior. Users are more likely to subscribe in the early days after they have downloaded the app, with most users subscribing to Real after 40 calendar days and naturally for a slightly lower number of active days on the App (see Fig. 1).

Fig. 1.
figure 1

Days to subscribe from initial app installation.

3.3 Data Preparation

Data Pre-processing.

The event data is composed of events generated when the end-user takes an action in the Real app. There are 208 different event types captured, including events as trivial as App Opened and App Closed and other events such as Saw Stock Details. The app usage event data contained over 10 million rows. A domain expert at our industry partner, reviewed the list of possible events and identified events considered important for strategic decision marking. Correspondingly other events can be ignored, and thus reduces the breadth and by extension complexity of the data.

To maximize the relevance and focus of the study, users who had less than five events were filtered out as well as users who had obviously churned, i.e. a user who had not used the app in the last 90 days and had an event count of less than 10 events.

Feature Engineering.

The individual events in isolation are meaningless for prediction purposes. Furthermore, the event data set is temporal in nature and as the event logs change over time, the events are difficult to encode and pass to a classification model. However, features can be engineered from the App usage event data and provided as input variables with the target classification algorithms. To achieve this, timestamped events were transformed into a matrix table with a row for each customer and the columns representing the possible events of two distinct types:

  1. 1.

    Aggregate user level engagement features such as total number of events, number of elapsed calendar days, number of active days, half-life for total number of events, average event count per active day, average event count per calendar day, half-life of user’s active days on app.

  2. 2.

    Event level features such as half-life of the event, average number of days between event occurrences, days since last and penultimate occurrence of the event, half-life of user’s activity days.

The feature engineering was performed in MS SQLServer, resulting in over 1000 features were engineered. More detail is provided in Sect. 4.

Clustering to Understand the Dataset.

The dataset under consideration includes users with varying degrees of app usage and tenure. To improve understanding of dataset and the engineered features, clustering was run on the dataset using a small number of user-level attributes, specifically total number of events, app calendar days, app activity days and half-life for total number of events. These attributes measure app user engagement from a few different perspectives. Four user clusters were identified.

Dimensionality Reduction.

Working with over 1000 dimensions is computationally expensive. Further steps were taken to prune the model, including removing features with zero variance and identifying events which can be considered as noise in the overall model such as App Opened and App Closed. These steps reduced the total number of feature engineered events to 382. Dimensionality reduction was a manual process. There are too many features to establish whether there is pairwise multi-collinearity in the dataset. The intention is to run the models, identify the subset of variables that are predictors and to perform pairwise multi-collinearity tests on this subset only.

Balancing the Data.

As previously mentioned, the dataset is imbalanced: 0.85% Real users vs. to 99.15% Sandbox only users. Our objective is to predict membership of the Real user class. Whilst all Sandbox only users are in scope of the analysis, it is important that there is balance across the dataset with good representation from both user types. Correspondingly, Real customers are included in the sampling, and the Sandbox only customers were under-sampled to ensure the required balance achieved. Two sampling approaches were considered: randomized sampling and stratified sampling based on the clustering analysis. Whilst randomized sampling is considered adequate Kelleher et al. [4], stratified sampling was preferred as it guarantees to maintain the relative frequencies of the different clusters identified.

3.4 Modelling

This stage is where we build and run supervised machine learning classification models. The stratified dataset was split 70% for Training/Test (using cross-fold) and 30% for Validation. The classifiers were run on the Train/Test dataset using the standard ten-fold cross validation approach. Logistic Regression, C5.0 Decision Trees and Support Vector Machines classification algorithms are run.

3.5 Evaluation

Each model produced is run on the Validation dataset. As well as overall model accuracy, sensitivity is measured which gives the proportion of positive examples (transitioned customers) that were correctly classified. Likewise, specificity is measured which gives the proportion of negative examples (non-transitioned customers) that were correctly classified. The models are run under a number of conditions as presented in the Evaluation section. Here, engagement with our industry partner with respect to findings also commences.

4 Implementation

This section describes the general ETL strategy, feature engineering, clustering analysis, data balancing and application of k-fold cross validation, as shown in Fig. 2.

Fig. 2.
figure 2

General architecture.

4.1 Extracting, Transforming and Loading App Data

Two CSV data files are extracted from mixpanel.com corresponding to the data for this paper: user profile data file and app usage event. The user profile contained one record per user. This CSV file contained basic user data, such as their city, app version, user’s iOS device. Emphasis is on the iOS app version as the Android version has only recently launched and there is insufficient usage data available. The app usage event data contains data for every event tracked by Mixpanel. The events can be either subscriber initiated, or platform initiated (i.e., push notifications).

4.2 Feature Engineering

Via SQL transformation, the features below were engineered and stored in a features table. Raw data was transformed into data that can be interpreted and consumed by the classification algorithms as depicted for example in Fig. 3. To provide a richer profile of users, several features were engineered: Total event count: the number of times an event was triggered for each user. Total active days in app: the number of days that the user has been active in the app. Total calendar days: number of elapsed days from when user was first seen to when the user was last seen. Total event half-life: The number of calendar days it takes a user to generate half of their total recorded events is also calculated. For example, if a user initiates 1000 events, the days between the first date any event was initiated and the date on which the 500th occurrence was observed is the half-life. Average event count per active day: The Total Event Count divided by the Total active days to give a measure of engagement per interaction day. Average event count per calendar day: The Total Event Count divided by the Total calendar days to give a sense of sparsity of interaction. Perhaps a user engages heavily when on the app but rarely actually opens the app. Half-life of user’s activity days: The number of calendar days taken to reach half of their total active days on the app. For example, if a user has 10 active days, the half-life will be the number of days between the first active day and the day on which their fifth interaction day occurs.

Fig. 3.
figure 3

Sample event transformation to engineer user record features

For each event type within a user’s dataset, event level features were generated. For example, if a user has 1000 events in the dataset containing 12 unique event types, features are engineered for the 12 unique events. Average days between an events previous occurrence: For each event, the number of days between the last time the event was triggered and the current observation of the event is computed. The sum of these time splits is divided by the number of times the event has occurred to provide a measure of event periodicity. The lower this value, the more it features within a user’s session with the app. Days since last and penultimate occurrence of the event: The difference in days between the final occurrence and the penultimate occurrence of an event is also recorded. This essentially measures a recency effect for each event. The recency value can help the model discover the most frequent events in the model most recently. Half-life of event: This is the number of calendar days it takes for a user to generate half of the total occurrences of a specific event in their logs.

A key part of the feature engineering process is removing customer event data after they have successfully monetised to the premium service. A primary objective is to capture the set of events or features within a user’s app usage that are predictive of them converting to the premium service. Therefore, the event data of a monetised customer becomes redundant after they sign on and deliver only noise to the model. Within the event logs, there are two events that initiate when a customer successfully signs up to the Invest Plus service. The time at which one of these events is captured for a user defines the monetisation point and any event beyond this moment is pruned from the feature engineering phase.

4.3 Clustering Analysis

Once the data cleaning and feature engineering was completed, K-means clustering was run on the data to increase understanding of the data and help stratify the data for the classification algorithms. A subset of the user-level engineered variables are chosen as the input variables for the clustering: Total Event Count, Event Count Half-Life, App Calendar Days and App Activity Days. As these variables have different measurement scales, the input variables are normalised to standardise them to have the same range. Clustering is run with k = 3, k = 4 and k = 5. Upon visual inspection of the clusters, and with the assistance of a scree plot k = 4 provided the most meaningful clusters.

Cluster 1, the Old Fogeys, represented light users who have used the app over the longest period (8%). Cluster 2, the Tyre Kickers, represented users who have been somewhat engaged with the app over a longer period of time (1%). Cluster 3, Recent Casuals, represents the most recent users who are less engaged than cluster 4 but more engaged than clusters 1 and 2 (59%). Finally, cluster 4, Enthusiastic Beginners, represents more recent users who are well engaged with the app when compared to the other clusters (32%). The standardised z-scores are included in Table 1.

Table 1. Standardized z-scores for k = 4 k-means clusters
Table 2. All classification models with 382 engineered features & stratified sampling with a 50/50 mix

Here, Calendar Days is a measure of tenure, active days, a measure of user activity, total event halflife, measures the initial engagement (how long the user required to reach half their total number of events), and total number of events the degree of use. In each of these four cases, a positive number indicates above average, 0 is average, and a negative number, below average. Count and frequency indicate the proportion of the data captured within each cluster.

4.4 Mitigating Unbalanced Data

Running the classification algorithms without under-sampling the dominant class would lead to distorted results. Thus, we include the full positive class in the new dataset. Then clustered the dominant class to understand the component clusters, which identified four clusters as described above. Finally, we took stratified samples of the dominant class using the four clusters to form the Sandbox only part of the dataset, as illustrated in Fig. 4. The models are built and run for both 90:10 and 50:50 datasets, split between the two customer types.

Fig. 4.
figure 4

Creating a stratified balanced dataset of 5 categorical user representations: 4 types non-premium users identified via k-means (Sandbox only) and 1 premium user type (Real).

4.5 Evaluating Machine Learning Models

To evaluate the machine learning models four steps are taken, as shown in Fig. 6: (1) The stratified dataset is split into Train/Test & Validation, with a 70:30 split. (2) 10 fold Cross-Validation is run on the Train/Test dataset. (3) The 30% validation dataset is passed into the models to test the accuracy. This step aided validation as the test data had been unseen by the model. (4) Key performance measures such as ROC, Sensitivity, Specificity are produced and examined. This process is repeated 10 times drawing samples from the overall dataset, and average measures are reported (Fig. 5).

Fig. 5.
figure 5

Evaluation strategy

5 Model Evaluation and Results

A number of classification scenarios are taken to evaluate the research objective using the same feature engineered dataset. This section describes these models and their results in further detail. Five classification scenarios are picked to address the challenge with different parameters.

  1. 1.

    Baseline Model with random sampling of the dominate Sandbox only class (90/10 mix)

  2. 2.

    All Classification Models with 382 engineered features & stratified sampling with a 50/50 mix

  3. 3.

    All Classification Models with 382 engineered features & stratified sampling with a 90/10 mix

  4. 4.

    All Classification Models with “important input variables” (108 in total) & stratified sampling with a 90/10 mix

  5. 5.

    All Classification Models run for the top 7 most important engineered features identified by the other models to create an ensemble result

For 2, 3 and 4, the classification models used were logistic regression (GLM), Gradient Boosting Machine (GBM), C5.0 Decision Tree and a Support Vector Machine (SVM) using the Radial Base Function kernal. For scenarios 2 and 3, 382 engineered features are used, which is a subset of the total engineered features removing features with zero variance. For scenario 4, a subset of the engineered features are used based on the expertise of the business team. All models are evaluated through the ROC curve, Sensitivity and Specificity measures.

5.1 Baseline Models

A number of baseline models were built to establish the level of difficulty in the prediction problem. This is useful as it permits gauging the complexity of the machine learning problem, which given the significant class imbalance is useful. Essentially, these models mirror reasonable informed business categorizations by a human.

  • Top Activity: predict “Yes”, the user is a Real subscriber (if user event count in the top 5% event counts, i.e. the user has a large number of events in the app) and “No” if if user activity below top 5%.

  • Calendar Days: predict “Yes”, the user is a Real subscriber if user has had the App less than 28 calendar days and “No” otherwise. In this case, we treat all new users are potential subscribers.

  • Active Calendar: predict “Yes”, the user is a Real subscriber if user has the App less than 28 calendar days and activity in the top 5% event counts (A+B) and otherwise predict “No”.

Top Activity and Active Calendar provide the highest accuracy (95.0 and 98.5% respectively), with Calendar Days achieving 21.9%. For sensitivity, Top Activity (0.579), Calendar Days (0.357), and Active Calendar (0.100) indicate that for Top Activity, approximately 58% of the time if a user will transition, they are correctly classified as such. For specificity, Top Activity achieves 0.953, Calendar Days 0.218, and Active Calendar 0.991, indicating that Active Calendar will correctly identify 99.1% users that will not transition. For such simple model, based mainly on human intuition, this is encouraging performance. However, note that the more critical performance measure is specificity, as it emphasizes a model’s ability to correctly classify transitioning users.

5.2 Machine Learning Models

Tables 3, 4, 5 and 6 illustrate the performance results for each of the machine learning scenarios described above, consisting of a 10-fold cross validation evaluation reporting the average performance measure achieved across the 10 leave one out folds (Table 2).

Table 3. Aggregate performance across scenarios 2–5
Table 4. Classification model with the most engineered features (382) & stratified sampling with a 90/10 mix
Table 5. Classification models with “important input variables” (108) & stratified sampling with a 90/10 mix
Table 6. Models trained using only the top 7 most important variables identified by the other models

The most important variables (i.e., the pre-engineered variables) were: (1) Stock: Did Tap On Favorite; (2) Showroom stock: tap stock; (3) Stock suggested; (4) Sign Up: Skip; (5) Link Broker: Tap (6) Link DriveWealth Account; (7) Saw Showroom: Couch Mark Sign Up: Success; (8) Stock Details: Graph Scrolled; (9) Sent to broker; (10) Stock Details: tap favorite.

Most important variables here in order of importance were: (1) Stock suggested; (2) Showroom stock: tap stock; (3) Sent to broker; (4) Sign Up: Success; (5) Showroom: tap BBN; (6) Order Popup: Tap fund; (7) Stock Details: Scroll; (8) Stock Details: tap favorite; (9) Stock Details: tap unfavourite; (10) DW: Order created

Most important variables here in order of importance were: (1) Stock Details: Graph Scrolled; (2) Stock Details: tap favorite; (3) Saw Store; (4) Broker: Order created; (5) Saw More; (6) Sent to broker; (7) Stock Details: tap invest now; (8) Saw Stock Details; (9) Store: did tap on purchase; (10) Order created.

Across each of these sampling methods with varying quantities of features, 7 appear consistently, namely: Stock Details: tap unfavorite_Total; Stock Details: Scroll_Total; Stock Details: tap favorite_Total; Stock suggested_Total; Stock Details: Graph Scrolled_Calendar; Stock: Did Tap On Favorite_Calendar; Stock Details: Graph Scrolled_Total. Thus, we include one modelling exercise using only these features (Table 6).

The most important reported engineered features as output by the models are present here. These are the top features across scenarios 2, 3 and 4 above. Sensitivity fell for this model though the specificity stayed high. Thus we can identify, that using stratified sampling and a rich (382 feature set) data set comprising of mostly engineered features, the C50 tree does well at identifying transitioning users.

To provide more depth with respect to classification performance specifically with respect to specificity and sensitivity Fig. 6 depicts the ROC curves for each of the 4 machine learning-based classification scenarios.

Fig. 6.
figure 6

ROC curves for each of the 4 machine learning scenarios.

5.3 Summary

An aggregate of scenarios 2–5 is developed, calculating an average of the measures across the four scenarios: “Important Features at 50–50 Sample”, “Full Feature Set”, “Important Features at 90–10 Sample” and the “7 Variables”. Taking the average across all approaches gives an overall measure of how robust the algorithm is when faced with different approaches in feature selection.

C50 and GBM in general perform similarly, with the C50 having a slightly higher F1 score. Kelleher et al. [4] suggest that F1 is a good measure for prediction problems and place an emphasis on capturing the performance of a prediction problem on the positive level (the important level). Hence, we recommend the C50 model ahead of the GBM, not just based on the F-Score, but also in having demonstrated exceptional sensitivity performance (correctly identifying a transitioning user). For this business situation, we also need to evaluate the cost of getting the prediction wrong. All customer engagement is through push notifications. A false negative, i.e., predicting that a user will not transition when the user could be a potential premium user, is a lost business opportunity. A false positive (incorrectly predicting a transitioning user) is not such a serious issue for this business. However, correctly identifying transitioning users, or those that start to display key features indicative of transitioning users is highly relevant, and has been effectively demonstrated here.

6 Conclusion and Future Work

In this paper, we have demonstrated that machine learning approaches can both identify key user behavioral indicators of freemium users transitioning into premium users, i.e. paying subscribers, of trading service apps, as well as accurately predict these transitions. The data used in this paper corresponds to 10 million user events over a period of months for a production trading and investment mobile app. To facilitate our approach, we employed the CRISP data mining methodology, significant feature engineering, and subsequently dimensionality reduction techniques. We benchmarked our models against informed business user practices to gauge both problem complexity as well as a means to demonstrate the potential of the machine learning techniques (C50, GBM, Logistic Regression, and a Support Vector Machine) employed.

Our industry partner is keen to further develop the capabilities reported in this paper. They plan to hire a data scientist whose full-time job will be develop and run machine learning models, building on these initial findings. The insights gained will be used to in a number of different ways:

  • to support targeted marketing to users who are at a certain stage in lifecycle

  • as an early warning system when expected behaviors are not occurring (e.g., maybe a new app version has bugs)

  • to support design enhancements to the product to enable easier display and use of the key predictive user events.

There are a couple of methodology limitations in our approach which can be addressed as future work:

  • Test for pairwise multi-collinearity due to the volume of features and the relative computation complexity of this task

  • Adopt alternate under-sampling and over-sampling mechanisms such as SMOTE and test for improvements

However, the key limitation of this work, justifying future work, is that the classification models did not distinguish between whether the user was going to make a one-off premium purchase (e.g. purchase a one-month subscription) or become a recurring user (e.g. annual subscription). Thus, there are significant further opportunities to extend our approach to predict the number of one-off purchases or annual subscriptions through regression models.