A novel approach for discovering retail knowledge with price information from transaction databases

doi:10.1016/j.eswa.2007.03.006

Expert Systems with Applications

Volume 34, Issue 4, May 2008, Pages 2350-2359

https://doi.org/10.1016/j.eswa.2007.03.006 Get rights and content

Abstract

With the advances in information technology and the emergence of Internet commerce, analysis of transaction data has become a crucial technique for effective decision-making and strategy formation in business operations. It is especially critical for retail management, in both online and brick-and-mortar stores. Traditional research in mining retail knowledge, however, does not take into account the products’ prices and how such settings can affect potential demand. This paper opens a new research dimension by treating products’ prices as an important decision variable in mining retail knowledge. To the best of our knowledge, the problem addressed in this paper has never been dealt with in existing research papers. We propose a representation scheme to incorporate price information into historical transaction data. An efficient algorithm is developed to “dig” out implicit, yet meaningful, patterns with price information. In addition, an extensive and well-designed experiment is executed, showing that the algorithm is computationally efficient and that the proposed analysis is significant and useful.

Introduction

Data mining extracts implicit, previously unknown, and potentially useful information from databases. According to the classification scheme proposed in Chen, Han, and Yu (1996), major approaches to data mining include mining association patterns, clustering, classification, mining sequential patterns, data generalization and summarization, and traversal pattern analysis. Among them, mining association patterns is probably the most popular because of its widespread applications. This approach was first introduced in Agrawal et al., 1993, Agrawal and Srikant, 1994, and can be stated as follows.

Given a database of sales transactions, an association pattern, denoted as X, is a set of items that frequently co-occur in databases. To find association patterns from databases, we first need to calculate the support of itemset X, where the support of X is the percentage of transactions in the database containing X. If its support is higher than the user-specified minimum support (minsup), we claim that itemset X is frequent. Otherwise, it is infrequent.

Since association patterns are useful and easy to understand, they have been used in many successful business applications, including finance, telecommunications, marketing, recommendation, retailing, and web analysis (Bose and Mahapatra, 2001, Changchien and Lu, 2001, Chen et al., 2005, Lee et al., 2001, Lin et al., 2003, Wang and Shao, 2004). The method has also attracted increased research interest, and many extensions have been proposed in recent years, including (1) algorithm improvements (Brin et al., 1997, Chen and Ho, 2005, Han et al., 2000, Rastogi and Shim, 2002), (2) fuzzy patterns (Chen and Huang, 2005, Kuok et al., 1998), (3) multi-level patterns (Clementini et al., 2000, Han and Fu, 1999), (4) quantitative association patterns (Park et al., 1997, Srikant and Agrawal, 1996, Hsu et al., 2004), (5) spatial association patterns (Clementini et al., 2000, Koperski and Han, 1995), (6) inter-transaction patterns (Lu, Feng, & Han, 2000), (7) interesting association patterns (Bayardo and Agrawal, 1999, Freitas, 1999), and (8) temporal association patterns (Ale and Rossi, 2000, Chen et al., 2003, Li et al., 2001, Roddick and Spiliopoilou, 2002). (Chen et al., 1996) and (Han & Kamber, 2006) give brief literature reviews of association patterns.

Previous research on mining association patterns in transaction databases usually assumed that a transaction is formed from a set of items bought in that transaction (Agrawal et al., 1993, Agrawal and Srikant, 1994). In other words, the research ignored items’ quantities and prices. Although this assumption is widely used, two difficulties may arise. First, in a practical situation, a transaction not only records purchased items, but also their quantities and prices. Therefore, if we view a transaction as only a set of items, a large portion of stored data is unused. Second, the association patterns found in conventional transaction databases only indicate if items are related or not; they do not tell us their quantity and/or price relationships. Without the quantity and price information, it is difficult to design a competitive package for sales promotions because we do not know how the prices and quantities of items influence one another. For example, we may have an association pattern, such as {milk, cheese}. This pattern would be more informative if it was more specific, such as {milk with high price, cheese with medium price}. The former only indicates that these two items are frequently bought together, but the latter tells us that this association happens when milk is at a high price and cheese at a medium price.

Some readers may wonder why we did not simply view price and quantity as numerical attributes and use methods for mining quantitative association patterns to deal with them (Hsu et al., 2004, Park et al., 1997, Srikant and Agrawal, 1996). Using this method, we can partition the prices or quantities of milk and cheese into multiple intervals. For example, we can partition price into five levels, where these five levels represent the list price, 0–10% off the list price, 10–20% off the list price, 20–30% off the list price, and more than 30% off the list price. Consequently, we can have a pattern like {milk with price level 2, cheese with price level 3}.

Although this idea seems reasonable, it may result in the following problems. Suppose the time span of the database is 12 months, and the prices of milk and cheese are at levels 2 and 3 only in June and July, respectively. Further assume that there are 200,000 total transactions in the database, and 50,000 of those transactions occurred in June and July. If there were 2000 transactions in June and July containing milk and cheese, what is the support of the pattern {milk with price level 2, cheese with price level 3}? Obviously, the answer should be 2000/50,000 = 4%, rather than 2000/200,000 = 1%. This is because [June, July] is the only period when this pattern can possibly occur, and the base of the support computation should be based on these two months rather than the entire year.

The above discussion reveals that we need a new method of defining the supports of patterns with price information. Accordingly, a new type of support, called local support, is proposed to measure the frequency of itemsets with price labels. There are still other problems, however, that we may encounter. Suppose we have three combinations of milk and cheese prices as follows: {milk: p-level 2, cheese: p-level 3}, {milk: p-level 2, cheese: p-level 1}, and {milk: p-level 3, cheese: p-level 3}. Suppose their local supports are 4%, 8%, and 2%, respectively. It is a reasonable conjecture that {milk: p-level 2, cheese: p-level 1} may be good for sales and {milk: p-level 3, cheese: p-level 3} may not, because the former seems to increase sales while the latter decreases sales. For good or bad, they both represent important information that deserves further analysis. This simple example illustrates that we need a new way to define important patterns. In the past, a pattern with high frequency was deemed important. What we are interested in now, however, are those patterns with frequencies deviating substantially from the average, either larger or smaller.

In addition to price information, we have to include other important information in the patterns, such as quantity information. Therefore, we define the new patterns with price and quantity information, such as {milk: p-level 2, average-qty 2.3; cheese: p-level 1, average-qty 3.5}. This means that when the price levels of milk and cheese are 2 and 1, respectively, the average quantities of milk and cheese in transactions are 2.3 and 3.5, respectively. By comparing all the patterns with the same product combination, we can understand how items’ prices influence one another and how items’ prices influence quantities. For example, assume that we have the following patterns:

{milk: p-level 2, average-qty 2.4; cheese: p-level 3, average-qty 3.3}, sup = 4%
{milk: p-level 2, average-qty 2.3; cheese: p-level 1, average-qty 3.5}, sup = 8%
{milk: p-level 1, average-qty 1.4; cheese: p-level 1, average-qty 2.0}, sup = 2%

Using the first pattern as the basis for comparison, we find that the price levels set in the second pattern may increase the frequency of purchase (the support increases from 4% to 8%), but does not change the average purchasing quantity. On the other hand, the third pattern not only decreases the purchasing frequency but also decreases the average purchasing quantity.

This paper defines two new metrics in order to measure a pattern’s level of interest. The frequency strength (FS) measures whether a pattern’s price levels affect its frequency significantly when compared to the average frequency of the same product combination. When the value of FS is greater than 1, the larger the value is, the higher the possibility of an increase in frequency in the price levels of the pattern will be. When the value of FS is smaller than 1, however, the smaller the value is, the more likely it is that the price levels will cause a frequency decrease. In addition to FS, another metric is the quantity strength (QS), which measures whether a pattern’s price levels affect its quantity significantly when compared to the average purchasing quantity. If the value of QS is greater than 1, the average quantity in transactions of that price combination is greater than the average quantity of the same product combination. If the value of QS is smaller than 1, it means the average quantity is lower. With these two metrics, the interesting patterns we want to find are those patterns whose price levels could result in either a large value or a small value in FS, QS, or both. After finding the interesting patterns, we then asked domain experts to help explain the patterns and find the true reasons underlying the phenomenon.

The rest of the paper is organized as follows. We formally define the problem in Section 2 and propose an algorithm in Section 3. In Section 4, we discuss how to conduct a series of analyses based on the two proposed metrics, FS and QS. The performance evaluation is performed in Section 5 using a real dataset. Conclusions are drawn in Section 6.

Section snippets

Problem definitions

Given a transaction database D over period T = {t₁, t₂,…,t_k}, where t_i is the ith time period, let I = {i₁, i₂, …, i_m} be the set of product items included in D. In this paper, we assume that each item may have different prices over different time periods, but in one period an item has only one price. Therefore, we use P = {p₁, p₂, p₃, …, p_w} to denote a set of price levels. For example, we can divide the prices of items into five levels, such as p₁, p₂, p₃, p₄, and p₅, which may represent different

The algorithm

We propose an Apriori-like algorithm for mining all global frequent itemsets and all of their p-itemsets. The algorithm is outlined in Fig. 2.

In describing the algorithm, we use GC_k to denote the set of candidate k-itemsets, GF_k to denote the set of all global frequent k-itemsets, and LC_k to denote the set of all candidate k-p-itemsets. In the traditional Apriori algorithm, a k-item candidate itemset must be a combination of k − 1 frequent itemsets because of the anti-monotone property (Agrawal

The analysis

After executing the pq-Apriori algorithm, we can compute the values of FS and QS for every p-itemset. Based on these values, two kinds of analyses are proposed in this section. In the first analysis, we classify all p-itemsets into nine different categories by their FS and QS values, where patterns in the same category have similar FS and QS values. In the second analysis, we design a statistical test to examine if the frequency (quantity) of a frequent itemset is sensitive to prices. Here, the

Evaluation

In this section, we use a real dataset to study the performances of the Apriori and pq-Apriori algorithms. Afterwards, we present and discuss the relevant knowledge discovered from the real dataset.

Conclusion

With advances in information technology and Internet commerce, data-driven analysis techniques have become essential for decision-making and strategy formation in business operations. It is especially critical for retail management, in both online and brick-and-mortar stores. Traditional research on mining retail knowledge focuses on assortment planning, demand correlation analysis, and customers’ shopping behavior analysis. It does not take into account the prices of products, and how price

Acknowledgements

The first author was supported in part by the MOE Program for Promoting Academic Excellence of Universities under Grant Number 91-H-FA07-1-4 and The National Science Foundation Grant Number 91-2416-H-008-003.

References (29)

I. Bose et al.
Business data mining – a machine learning perspective
Information and Management
(2001)
S.W. Changchien et al.
Mining association rules procedure to support on-line recommendation by customers and products fragmentation
Expert Systems with Applications
(2001)
Y.L. Chen et al.
Discovering time-interval sequential patterns in sequence databases
Expert Systems with Applications
(2003)
Y.L. Chen et al.
Market basket analysis in a multiple store environment
Decision Support Systems
(2005)
E. Clementini et al.
Mining multiple-level spatial association rules for objects with a broad boundary
Data and Knowledge Engineering
(2000)
A. Freitas
On rule interestingness measures
Knowledge-Based Systems
(1999)
P.Y. Hsu et al.
Algorithms for mining association rules in bag databases
Information Sciences
(2004)
C.H. Lee et al.
Web personalization expert with combining collaborative filtering and association rule mining technique
Expert Systems with Applications
(2001)
Q.Y. Lin et al.
Mining Inter-organizational retailing knowledge for an alliance formed by competitive firms
Information and Management
(2003)
F.H. Wang et al.
Effective personalized recommendation based on time-framed navigation clustering and association mining
Expert Systems with Applications
(2004)

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th VLDB...

Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In...

Ale, J. M., & Rossi, G. H. (2000). An approach to discovering temporal association rules. In Proceedings of the 2000...

Bayardo Jr., R. J., & Agrawal, R. (1999). Mining the most interesting rules. In Proceedings of the 5th ACM SIGKDD...

Cited by (8)

Intelligent apparel product cross-selling using radio frequency identification (RFID) technology for fashion retailing
2014, Fashion Supply Chain Management Using Radio Frequency Identification (RFID) Technologies
This chapter demonstrates how a combined use of radio frequency identification (RFID) technologies and the Intelligent Product Cross-selling System (IPCS) can improve cross- and up-selling in the retail industry. In this study, two systems have been developed, namely the Smart Dressing System (SDS) enabled by RFID technologies, and the IPCS. The SDS demonstrates a research endeavour in which, unlike the previous studies which focused on transactional data, customers’ in-store data can be collected using RFID-enabled SDS. This data can therefore be used for promoting or cross-selling new products to the customers more effectively and efficiently. The IPCS, integrating a rule-based expert system and a fuzzy screening technique, can handle the difficulties of processing linguistic and categorical information. This means fashion designers can recommend appropriate fashion product items for cross-selling with greater ease. The proposed systems’ ability to improve selling strategies for the fashion retail industry will in turn help to increase their sales performance.
Intelligent product cross-selling system in fashion retailing using radio frequency identification (RFID) technology, fuzzy logic and rule-based expert system
2013, Optimizing Decision Making in the Apparel Supply Chain Using Artificial Intelligence (AI): From Production to Retail
This chapter presents a combined use of radio frequency identification (RFID) technology and a product cross-selling system to perform cross-selling and up-selling for the retail industry. A smart dressing system (SDS) enabled by RFID technologies and an intelligent product cross-selling system (IPCS) have been developed. Customers’ in-store data can be collected using RFID-enabled SDS and used for promoting or cross-selling new products. The IPCS, integrating a rule-based expert system and a fuzzy screening technique, can process linguistic and categorical information to simulate fashion designers and recommend appropriate fashion product items for cross-selling. The proposed systems execute the selling strategies more effectively, which improves sales performance in the fashion retail industry.
Intelligent product cross-selling system with radio frequency identification technology for retailing
2012, International Journal of Production Economics
Citation Excerpt :
As a result different kinds of systems have been developed to cater for the needs of these retailers. Some of these systems include geographical information systems (Nasirin and Birks, 2003; Tayma and Pol, 1995), inter-organizational information systems (Lin et al., 2003), and data mining systems (Bose and Mahapatra, 2001; Chen et al., 2008). They enable retailers to make decisions in such areas as replenishment, inventory control and marketing, and promotion strategies.
This paper presents a combined use of both RFID technology and product cross-selling system to perform cross- and up-selling for retail industry. In this study, two systems, namely the Smart Dressing System (SDS) enabled by RFID technologies, and Intelligent Product Cross-selling System (IPCS), have been developed. The SDS demonstrates a research endeavor in which, unlike the previous studies focusing on transactional data, customers' in-store data can be collected using RFID-enabled SDS and used for promoting or cross-selling new products to the customers effectively and efficiently. The IPCS, integrating a rule-based expert system and a fuzzy screening technique, can handle the difficulty of processing linguistic and categorical information to simulate fashion designers to recommend appropriate fashion product items for cross-selling effectively. The proposed systems are evaluated to execute the selling strategies more effectively, which in turn improve sales performance in fashion retail industry.
Decision analysis of data mining project based on Bayesian risk
2009, Expert Systems with Applications
Data mining, an efficient method of business intelligence, is a process to extract knowledge from large scale data. As the augment of the size of enterprise and the data, data mining as a way to make use of the data become more and more necessary. But now most of the literatures only focus on the algorithm itself. Few literatures research what qualification to fulfill before the decision doing data mining from the perspective of the company manager. This paper discusses the factors affect the data mining project. Based on the Bayesian risk, we build a model taking the risk attitude of the top executive in account to help them make decision whether to do data mining or not.
Efficient Method for Mining Frequent Itemsets using Temporal Data
2018, 2018 International Conference on Information, Communication, Engineering and Technology, ICICET 2018
A New Methodology for Mining Frequent Itemsets on Temporal Data
2017, IEEE Transactions on Engineering Management

View all citing articles on Scopus

View full text

A novel approach for discovering retail knowledge with price information from transaction databases

Abstract

Introduction

Section snippets

Problem definitions

The algorithm

The analysis

Evaluation

Conclusion

Acknowledgements

Information and Management

Expert Systems with Applications

Expert Systems with Applications

Decision Support Systems

Data and Knowledge Engineering

Knowledge-Based Systems

Information Sciences

Expert Systems with Applications

Information and Management

Expert Systems with Applications