Keywords

1 Introduction

In recent years, electronic commerce (hereinafter called “EC”) continues to evolve at a rapid pace [1]. With expansion and growth of the EC market, it is expected that the competition of getting customers will be fierce. Choosing appropriate target customers is very important for expanding sales and improving profitability.

Therefore, the EC company is required to find new customers who have the potential of becoming loyal customers as soon as possible. Here, the first purchase date can be considered a point. We look forward to the common behaviors of these customers in their initial purchases. Customers raise customer satisfaction, so that companies improve sales and profits. It is desirable to have such a relationship between both sides that can benefit from each other.

Figure 1 shows the framework of customers hierarchy. First, customers visit the website. Upper-level customer purchase frequently and high amount. Then, finding these loyal customers and developing new loyal customers are very important strategies for the retail company.

Fig. 1.
figure 1

Framework of customer hierarchy

In this study, we focused on new customers and the purpose is to clarify the characteristic behaviors of high loyal customers using customer’s membership information data, purchase data and access historical data.

2 Datasets

We target on a general electronic commerce website (hereinafter called “EC site”) relating to golf. The EC site provides some services such as EC of golf equipment, reservations for golf courses, manage golf score, etc. From among these services, we used the following data.

  • Customer information data (age, sex, registration date, etc.)

  • Purchase history data (category of purchase items, purchase date, whether purchased item is brand-new or secondhand, etc.)

  • Access history data (log in date and time, URL of access page, URL of referrer page, etc.)

The category name of the product included in the purchase data is shown in Table 1.

Table 1. Category name of item

Target Customer

In this study, we analyzed 5,553 customers who purchased for the first time from May 1, 2015, to July 30, 2015, and purchased more than twice a year from the initial purchase date. We exclude the customer who has passed for more than one year from registration.

In Fig. 2, we show the target period used in this research.

Fig. 2.
figure 2

Target period

Explanatory Variables

We considered the impact factors to the first purchase using the above data. Based on the result, we created the explanatory variables such as customer’s member information (5 variables), purchasing behavior at the time of initial purchase (11 variables) and web browsing behavior at the initial purchase date (13 variables) [4].

Details of the explanatory variables are shown in Tables 2, 3 and 4.

Table 2. Demographic variables used in the model construction.
Table 3. Purchasing behavior used in the model construction.
Table 4. Access history variables used in the model construction.

Table 2 presents demographic variables created by membership information data.

Table 3 demonstrates purchasing behavior variables created by purchase data.

Table 4 shows Access History Variables created by web browsing data.

3 Analysis of Loyal Customer

In this study, we analyze the behavior of the initial order date for customers who purchase more than once a year using customer membership information data, purchase records data and web access logs data on a golf EC site.

As an analysis, firstly we evaluated customer loyalty for new customers by RFM analysis. We determined customers’ loyalties with three purchasing behavior indicators (Recency, Frequency, Monetary) and categorized them as loyal customers and general customers based on this.

Next, we created variables related to the initial purchase and exploratory behavior and constructed a discrimination model of customer loyalty by logistic regression analysis. Through these analyses, we worked to grasp the characteristics of customers with high loyalties at the initial order date.

3.1 RFM Analysis

RFM analysis is one of the most common approaches in database marketing. RFM analysis is a proven marketing model for behavior-based customer segmentation. It groups customers on recency, frequency, and monetary value can indicate customer.

RFM analysis segments customers on recency, frequency, and monetary value can indicate customer We evaluated the loyalty of customers using RFM analysis to divide customers into loyal and general ones [2]. Commonly, the F in RFM analysis is determined by the number of purchases. Here, we defined F by the total number of logins instead of the number of purchase, because frequent browsing behavior is also relates to customer’s loyalty for the website.

RFM stands for the three dimensions:

  • Recency: Period since last purchase

  • Frequency: Total number of logins within the period

  • Monetary: Amount of purchase within the period

The approach to RFM is to assign a score for each dimension on a scale from 1 to 5. The maximum score represents the preferred behavior.

Customers are divided into five scales equally for each of recency, frequency, monetary. The maximum score of RFM stands for the three dimensions:

  • Recency: The maximum score (5) represents the shortest number of days that have passed since the customer last purchased within a year.

  • Frequency: The maximum score (5) represents the longest number of logins within a year.

  • Monetary: The maximum score (5) represents the highest value of all purchases within a year.

3.2 Binomial Logistic Regression

The purpose of this study is to predict the high loyal customers by using the initial purchase and browsing behaviors. When the objective variable to be predicted is binary, binomial logistic regression models are often used.

The Binomial logistic regression model is a type of classifier that performs class discrimination. By interpreting significant explanatory variables in the constructed model, it is possible to clarify the characteristics that affect the presence or absence of repurchase. In the binomial logistic regression analysis, the customer’s repurchase probability pi is expressed by the following equation [3].

$$ p_{i} = \frac{{\exp \left\{ {\sum\nolimits_{j = 0}^{m} {\beta_{j} X_{ij} } } \right\}}}{{1 + \exp \left\{ {\sum\nolimits_{j = 0}^{m} {\beta_{j} X_{ij} } } \right\}}} $$
(1)
  • \( X_{ij} : \) Factors affecting repurchase (\( X_{i0} = 1) \)

  • \( \beta_{j} : \) Parameters for each explanatory variable (\( \beta_{0} \) is intercept)

We prepared variables related to demographic variables, initial purchase behavior and exploratory behavior (Tables 2, 3 and 4) and constructed a discrimination model of customer loyalty by binomial logistic regression analysis. Here, we label the loyal customer as 1, and the general customer as 0.

In logistic regression analysis, when the explanatory variable is excessive, it may be difficult to interpret the regression equation, or the versatility of prediction of the objective variable may decrease. It may occur multicollinearity problem due to some variables have a high correlation. Therefore, in this study, to select true effective variables, we used stepwise method based on Akaike’s Information Criterion (AIC).

In order to confirm the discrimination accuracy of the model, we divided the data used in the logistic regression analysis into two groups (Group A, Group B), and performed a 2-fold cross-validation method.

The cross-validation method is mainly used in settings where the purpose is a prediction, and one wants to estimate how accurately a predictive model will perform in practice.

In order to confirm the prediction accuracy of the constructed model, we performed hold-out validation by using the training data and test data. Specifically, we created a confusion matrix like Table 5 and we calculated prediction accuracy of the constructed model by using the following equations.

Table 5. Confusion matrix

Accuracy (ACC): Percentage of the total number correctly predicted among the total number predicted.

$$ {\text{ACC}} = \frac{TP + TN}{FP + FN + TP + TN} $$
(2)

Precision (PRE): Percentage of the total number that is a positive class actually among the total number predicted positive class.

$$ {\text{PRE}} = \frac{TP}{TP + FP} $$
(3)

Recall (REC): Percentage of the total number predicted positive class among the total number that is a positive class actually

$$ {\text{REC}} = \frac{TP}{FN + TP} $$
(4)

F-measure: harmonic mean of PRE and REC

$$ {\text{F-measure}} = 2 \times \frac{PRE \times REC}{PRE + REC} $$
(5)

4 Results and Discussions

In this section, we show our analyzing results and discuss them.

4.1 RFM Analysis

Customers were divided into five equal scales equally for each of recency, frequency, monetary. Categories for each attribute of RFM are shown in Table 6.

Table 6. Categories for each attribute of RFM

Although the number of target customers in this research was 5,553, at the time of model construction, we randomly sampled the number of general customers by setting the number equal to the number of loyal customers.

The number of datasets (Group A, Group B) used in these model constructions are shown Table 7.

Table 7. Datasets used in prediction model

4.2 Binomial Logistic Regression

In each iteration, the model will be fit to one group of the data, and used to predict the other group.

We built two models that predicts loyal customer for the customers using binomial logistic regression analysis with AIC based the stepwise selection method.

The evaluation indicator for confirming the prediction accuracy are shown Table 8.

Table 8. Evaluation indicator of model for customers (%)

Both models are over accuracies. Since the conventional researches on the EC site had the accuracies about 60%, it can be said that this research gained sufficient prediction accuracy.

The accuracy is high when group A is used as training data. Table 9 shows the partial regression coefficients.

Table 9. Partial regression coefficients.

There are 11 variables selected from 29 candidate variables.

From Table 9, we can see that variables created from purchase data are selected much. In addition, the confusion matrix for the test data of this model is shown in Table 10.

Table 10. Confusion matrix of model for customers

4.3 Discussions

We selected the explanatory variables which the coefficient of the significant probability of less than 0.05. There are 8 explanatory variables selected (Table 11).

Table 11. Estimated value of selected partial regression coefficient

Overall, since all the partial regression coefficients are positive numbers, it was found that the higher the value of all the selected variables, the more likely to become loyal customers.

In all the variables, total number of items purchased at the initial order date is the highest partial regression coefficient. It seems that the loyalties will be improved by raising customer satisfaction such as giving coupons or gifts to customers with high purchase quantities at the initial order date.

Since partial regression coefficient of “Whether the member registration date matched the initial order date or not” is positive as well, we considered that customers who were interested for a long time and took a long time to purchase. From this result, it seems that recommendations of similar items promote purchase.

It seems that recommending the items of men’s wear, golf club, accessory on sale items to the customers registered as a member and did not purchase leads to promotion of purchasing.

It is considered that it is necessary to improve the loyalty of customers by recommending goods to be compared without limiting prices at the initial purchase.

4.4 Verification

We verified with the data of the same period two years later using the prediction model built this time. The results are shown in Tables 12 and 13.

Table 12. Confusion matrix of model for customers
Table 13. Evaluation indicator of model for customers (%)

Here, although high prediction accuracy was obtained, the precision was low. It is considered that this model distinguishes loyal customers and general customers well, but it could not confirm loyal customers correctly.

5 Conclusion

In this study, we determined customers’ loyalties by RFM analysis and constructed a discrimination model of customer loyalty by logistic regression analysis to find characteristic behavior of loyal customers on a golf EC site.

Through our analyses, we built a useful model to predict loyal customers using the web access logs and purchase records data at initial purchase on a golf EC site. As a result, we could clarify the initial purchase and browsing behavior of high loyal customers and tried to propose marketing measures. Even for the data after two years, the model we made this time got a high accuracy.

However, we are conducting a prediction from the data at one point in this study. It is important to check the prediction accuracy of loyal customers by analyzing the data at the transition time.