1 Introduction

Recently, electronic commerce (EC) market extends drastically by spread of Internet. Purchasing on Internet has been expanding year after year on facility reservation products, but the risk of cancellation is also increasing. While EC site establish as a purchase channels, it is important matter that the reduction of the opportunity loss by the cancellation of user in an reservation sites of the hospitality industry such as hotel and golf courses. Under such situation, there are many studies purposing on the reducing cancellation [1, 2]. However, these studies did not focus on characteristics of cancellation targets, it is not satisfying that the study of the cancellation factor to focus on action at the time of the cancellation of user.

2 Purpose of This Study

The study purposes to identify cancellation factors based on the characteristics of golf courses in reservation sites.

This study uses the data provided by a golf courses reservation site. Specifically, this study uses reservation data, user data, causal data, and review data. Reservation data contains some information of golf courses such as play date and play fee. User data contains user attribute such as gender and generation. Causal data is consisted on golf course attribute such as price range, course type and capacity. Review data contains review of golf course such as review score. In this study, we use only data which have following conditions.

  • Reservation data

    • Play date on April 1, 2014–March 31, 2015

    • Play start at 7:00–11:00

  • User data

    • Age between 20–79

  • Causal data

    • Located in Kanto district

  • Review data

    • Golf courses existing play data in the term of reservation data

3 Analysis of Cancellation Factors

3.1 Flow of Analysis

We show the outline of analysis in Fig. 1.

Fig. 1.
figure 1

Outline of analysis

Firstly, we classify golf courses using causal data such as price range, course type and capacity, and review data such as review score on the basis of the thought that it is difference that action at the time of the cancellation of user by characteristics of golf courses. Moreover, we defined the characteristics of each cluster. Secondly, to identify factors of cancellation of golf courses, we perform logistic regression analysis targeting on each cluster using reservation data such as play date and play price and user data such as gender and generation.

3.2 K-means Clustering

We classify golf courses using causal and review data on the basis of the thought that it is difference that action at the time of the cancellation of user by characteristics of golf courses. We use k-means clustering to classify golf courses. \( d \) is defined as formula (1).

$$ d = \sum\nolimits_{{x_{J} \in X}} {\mathop {\hbox{min} }\limits_{i \in k} \left\| {x_{i} - c_{i} } \right\|^{2} } $$
(1)

\( x_{j } (j = 1, \ldots ,n) \) is the value for \( j \), and \( n \) is the number of cases. Also \( c_{i} \) is the center of cluster \( i \left( {i = 1, \ldots ,k} \right) \) [3, 4]. We classify golf courses into four clusters from the viewpoint of easy to interpret.

3.3 Logistic Regression Analysis

In order to identify factors of cancellation each characteristics of golf courses, we try to perform logistic regression analysis targeting on each cluster using reservation and user data. When regression coefficient is defined as \( \beta_{k} \) and explanatory variable is defined as \( x_{k} \), probability of occurring cancellation \( p \) is shown as formula (2) [5].

$$ p = \frac{{{ \exp }\{ \beta_{0} + \beta_{1} x_{1} + \cdots + \beta_{k} x_{k} \} }}{{1 + { \exp }\{ \beta_{0} + \beta_{1} x_{1} + \cdots + \beta_{k} x_{k} \} }} $$
(2)

Where \( p \) of (2) defines below, and we estimate parameter \( \beta_{k} \).

$$ p = \left\{ {\begin{array}{*{20}l} 1 \hfill & \cdots \hfill & {\text{Cancelled}} \hfill \\ 0 \hfill & \cdots \hfill & {\text{played}} \hfill \\ \end{array} } \right. $$

4 Results

4.1 Classification of Golf Courses by the Characteristics

Firstly, we classify golf courses using causal and review data on the basis of the thought that it is difference that action at the time of the cancellation of user by characteristics of golf courses. We use k-means clustering to classify golf courses. Variables used by k-means clustering are shown below (Table 1).

Table 1. Variable used by k-means clustering

Standardized variables of Table 1 are used by k-means clustering. Also, the number of cluster was tried from three to six. Based on the interpret Table 4 cluster model own supported. Basic aggregation of quantitative data is shown in Table 2.

Table 2. Summary statistics of quantitative data of k-meams clustering

In each cluster contain 25 to 36 golf courses. Secondly, Following characteristic, we show the component ratios of variable in Figs. 2, 3, 4 and 5.

Fig. 2.
figure 2

Component ratio of price range

Fig. 3.
figure 3

Component ratio of course type

Fig. 4.
figure 4

Component ratio of self-play

Fig. 5.
figure 5

Component ratio of hotel

  • Figure 2 shows rate of price range. Cluster 2 and cluster 4 include many low price and cluster 3 includes many high price.

  • Figure 3 shows rate of course type. Only cluster 1 includes mountain, forest, and riverbed.

  • Figure 4 shows rate of self-play. Only cluster 3 does not include self-play.

  • Figure 5 shows rate of the presence in a hotel. Only cluster 2, all golf courses have a hotel. Moreover only cluster 4, all golf courses do not have a hotel.

The features of classified four clusters were shown in Table 3.

Table 3. The features of each cluster

There are several features each cluster. The variables affecting classification are price range, course type, self-play, hotel, capacity, review score and number of review. From the features of golf courses (Table 3), we defined the characteristics of each cluster as Table 4.

Table 4. The characteristics of each cluster

4.2 Identification of Cancellation Factors Each Characteristic of Golf Courses

Secondly, we performed logistic regression analysis with respect to each cluster using reservation and user data for identifying factor of cancellation each characteristic of golf courses. Data used by logistic regression analysis is shown below (Table 5).

Table 5. Data used by logistic regression analysis

It can be found that there is almost same in cancellation rate for each cluster. The variable of logistic regression analysis are shown below (Table 6).

Table 6. Variables used by logistic regression analysis

We select variables by stepwise method for each model. Moreover, we checked accuracy by 10-fold cross-validation. The results of logistic regression analysis are shown in Tables 7, 8, 9 and 10 and Figs. 6, 7, 8 and 9. The asterisks *, **, *** and period · indicate that the coefficients are statistically different from zero at the 0.1, 1, 5, 10 percent level in Tables 7, 8, 9 and 10.

Table 7. Result of logistic regression analysis (cluster 1)
Table 8. Result of logistic regression analysis (cluster 2)
Table 9. Result of logistic regression analysis (cluster 3)
Table 10. Result of logistic regression analysis (cluster 4)
Fig. 6.
figure 6

Parameter \( \beta_{k} \) of logistic regression analysis (cluster 1)

Fig. 7.
figure 7

Parameter \( \beta_{k} \) of logistic regression analysis (cluster 2)

Fig. 8.
figure 8

Parameter \( \beta_{k} \) of logistic regression analysis (cluster 3)

Fig. 9.
figure 9

Parameter \( \beta_{k} \) of logistic regression analysis (cluster 4)

Cluster 1: Unique Courses

Table 7 shows the result, and Fig. 6 shows the value of parameter \( \beta_{k} \) of cluster 1. Seven variables are selected by stepwise method in R language. Main variables affecting cancellation are Play, Reservation, Holiday, Score and Start_8. As the result of 10-fold cross-validation, accuracy is 61.5%.

Cluster 2: Public Courses Being Able to Stay

Table 8 shows the result, and Fig. 7 shows the value of parameter \( \beta_{k} \) of cluster 2. Eight variables are selected by stepwise method in R language. Main variables affecting cancellation are Generation_70, Reservation, Play, Holiday, Mail and Registration. As the result of 10-fold cross-validation, accuracy is 63.2%.

Cluster 3: High Class and Popular Courses

Table 9 shows the result, and Fig. 8 shows the value of parameter \( \beta_{k} \) of cluster 3. Seven variables are selected by stepwise method in R language. Main variables affecting cancellation are Reservation, Play, Gender, Season and Holiday. As the result of 10-fold cross-validation, accuracy is 61.9%.

Cluster 4: Public Courses Being Unable to Stay

Table 10 shows the result, and Fig. 9 shows the value of parameter \( \beta_{k} \) of cluster 4. Eight variables are selected by stepwise method in R language. Principal variables affecting cancellation are Reservation, Play, Price, Gender and Season. As the result of 10-fold cross-validation, accuracy is 63.8%.

The characteristic variables of each models is shown below (Table 11). Also, (+) show plus regression coefficient, and (−) show minus regression coefficient.

Table 11. The characteristic variables of each model

The accuracy of each model is shown in Table 12.

Table 12. The accuracy of each model

5 Discussions

The common cancellation factors of all clusters are the users who reserved frequently and reserved early. It is thought that the users who is used to use reservation sites decrease resistance to cancellation. Also, for the users reserving early is the largest cancellation factor, it is thought that carrying out the policy which not canceled by them is connected with decreasing cancellation.

The cancellation factors of cluster 1 (unique courses) are the users who do not use score management service, and the reservation that start time is early. Cluster 1 contains golf courses with small capacity and course type is other than hills. Moreover, because of Large Dispersion of review score, many of them may be golf courses where user preference is divided. For unique courses, it is thought that the users who consider that condition such as start time is more important than course condition tend to cancel.

The cancellation factor of cluster 2 (public courses being able to stay) is young users. Cluster 2 contains golf courses with large capacity and hotel facility. Moreover, because of low price, many of them may be a golf course of public course. For Public courses being able to stay, it is thought that young users who tend to enter schedule such as business tend to cancel more than the old users who do not tend to enter schedule such as business.

The cancellation factors of cluster 3 (High class and popular courses) are female users and the reservation that is not high season. Cluster 3 contains golf courses with high price and not self-play without caddy service. Moreover, because of large review score, many of them may be is a golf course of popular course. For popular courses, it is thought that the reservation that is high season do not tend to cancel because of tending to fill up the reservation.

The cancellation factors of cluster 4 (Public courses being unable to stay) is the reservation that play price is high. Cluster 4 contains golf courses with many review and without hotel in golf course. Moreover, because of low price, many of them may be a golf course of public course. For public courses being unable to stay, it is thought that many users think that play price want to keep low.

6 Conclusion

In this study, we considered to identify the cancellation factors based on the characteristics of golf courses in reservation sites. Firstly, we classified golf courses using causal and review data on the basis of the thought that it is difference that action at the time of the cancellation of user by characteristics of golf courses. As the result, golf courses were classified into four clusters, and we defined characteristics of golf courses each cluster. Secondly, in order to identify factor of cancellation each characteristic of golf courses, we performed logistic regression analysis each cluster using reservation data and user data. Through these analyses, we could identify same common cancellation factors and different cancellation factors each characteristic of golf courses.

However, accuracy of each model was not high. Accordingly, it is a future work that constructing better models. For example, devising explanatory variable, and performing other analysis are considered.