A novel approach for discovering retail knowledge with price information from transaction databases
Introduction
Data mining extracts implicit, previously unknown, and potentially useful information from databases. According to the classification scheme proposed in Chen, Han, and Yu (1996), major approaches to data mining include mining association patterns, clustering, classification, mining sequential patterns, data generalization and summarization, and traversal pattern analysis. Among them, mining association patterns is probably the most popular because of its widespread applications. This approach was first introduced in Agrawal et al., 1993, Agrawal and Srikant, 1994, and can be stated as follows.
Given a database of sales transactions, an association pattern, denoted as X, is a set of items that frequently co-occur in databases. To find association patterns from databases, we first need to calculate the support of itemset X, where the support of X is the percentage of transactions in the database containing X. If its support is higher than the user-specified minimum support (minsup), we claim that itemset X is frequent. Otherwise, it is infrequent.
Since association patterns are useful and easy to understand, they have been used in many successful business applications, including finance, telecommunications, marketing, recommendation, retailing, and web analysis (Bose and Mahapatra, 2001, Changchien and Lu, 2001, Chen et al., 2005, Lee et al., 2001, Lin et al., 2003, Wang and Shao, 2004). The method has also attracted increased research interest, and many extensions have been proposed in recent years, including (1) algorithm improvements (Brin et al., 1997, Chen and Ho, 2005, Han et al., 2000, Rastogi and Shim, 2002), (2) fuzzy patterns (Chen and Huang, 2005, Kuok et al., 1998), (3) multi-level patterns (Clementini et al., 2000, Han and Fu, 1999), (4) quantitative association patterns (Park et al., 1997, Srikant and Agrawal, 1996, Hsu et al., 2004), (5) spatial association patterns (Clementini et al., 2000, Koperski and Han, 1995), (6) inter-transaction patterns (Lu, Feng, & Han, 2000), (7) interesting association patterns (Bayardo and Agrawal, 1999, Freitas, 1999), and (8) temporal association patterns (Ale and Rossi, 2000, Chen et al., 2003, Li et al., 2001, Roddick and Spiliopoilou, 2002). (Chen et al., 1996) and (Han & Kamber, 2006) give brief literature reviews of association patterns.
Previous research on mining association patterns in transaction databases usually assumed that a transaction is formed from a set of items bought in that transaction (Agrawal et al., 1993, Agrawal and Srikant, 1994). In other words, the research ignored items’ quantities and prices. Although this assumption is widely used, two difficulties may arise. First, in a practical situation, a transaction not only records purchased items, but also their quantities and prices. Therefore, if we view a transaction as only a set of items, a large portion of stored data is unused. Second, the association patterns found in conventional transaction databases only indicate if items are related or not; they do not tell us their quantity and/or price relationships. Without the quantity and price information, it is difficult to design a competitive package for sales promotions because we do not know how the prices and quantities of items influence one another. For example, we may have an association pattern, such as {milk, cheese}. This pattern would be more informative if it was more specific, such as {milk with high price, cheese with medium price}. The former only indicates that these two items are frequently bought together, but the latter tells us that this association happens when milk is at a high price and cheese at a medium price.
Some readers may wonder why we did not simply view price and quantity as numerical attributes and use methods for mining quantitative association patterns to deal with them (Hsu et al., 2004, Park et al., 1997, Srikant and Agrawal, 1996). Using this method, we can partition the prices or quantities of milk and cheese into multiple intervals. For example, we can partition price into five levels, where these five levels represent the list price, 0–10% off the list price, 10–20% off the list price, 20–30% off the list price, and more than 30% off the list price. Consequently, we can have a pattern like {milk with price level 2, cheese with price level 3}.
Although this idea seems reasonable, it may result in the following problems. Suppose the time span of the database is 12 months, and the prices of milk and cheese are at levels 2 and 3 only in June and July, respectively. Further assume that there are 200,000 total transactions in the database, and 50,000 of those transactions occurred in June and July. If there were 2000 transactions in June and July containing milk and cheese, what is the support of the pattern {milk with price level 2, cheese with price level 3}? Obviously, the answer should be 2000/50,000 = 4%, rather than 2000/200,000 = 1%. This is because [June, July] is the only period when this pattern can possibly occur, and the base of the support computation should be based on these two months rather than the entire year.
The above discussion reveals that we need a new method of defining the supports of patterns with price information. Accordingly, a new type of support, called local support, is proposed to measure the frequency of itemsets with price labels. There are still other problems, however, that we may encounter. Suppose we have three combinations of milk and cheese prices as follows: {milk: p-level 2, cheese: p-level 3}, {milk: p-level 2, cheese: p-level 1}, and {milk: p-level 3, cheese: p-level 3}. Suppose their local supports are 4%, 8%, and 2%, respectively. It is a reasonable conjecture that {milk: p-level 2, cheese: p-level 1} may be good for sales and {milk: p-level 3, cheese: p-level 3} may not, because the former seems to increase sales while the latter decreases sales. For good or bad, they both represent important information that deserves further analysis. This simple example illustrates that we need a new way to define important patterns. In the past, a pattern with high frequency was deemed important. What we are interested in now, however, are those patterns with frequencies deviating substantially from the average, either larger or smaller.
In addition to price information, we have to include other important information in the patterns, such as quantity information. Therefore, we define the new patterns with price and quantity information, such as {milk: p-level 2, average-qty 2.3; cheese: p-level 1, average-qty 3.5}. This means that when the price levels of milk and cheese are 2 and 1, respectively, the average quantities of milk and cheese in transactions are 2.3 and 3.5, respectively. By comparing all the patterns with the same product combination, we can understand how items’ prices influence one another and how items’ prices influence quantities. For example, assume that we have the following patterns:
{milk: p-level 2, average-qty 2.4; cheese: p-level 3, average-qty 3.3}, sup = 4%
{milk: p-level 2, average-qty 2.3; cheese: p-level 1, average-qty 3.5}, sup = 8%
{milk: p-level 1, average-qty 1.4; cheese: p-level 1, average-qty 2.0}, sup = 2%
Using the first pattern as the basis for comparison, we find that the price levels set in the second pattern may increase the frequency of purchase (the support increases from 4% to 8%), but does not change the average purchasing quantity. On the other hand, the third pattern not only decreases the purchasing frequency but also decreases the average purchasing quantity.
This paper defines two new metrics in order to measure a pattern’s level of interest. The frequency strength (FS) measures whether a pattern’s price levels affect its frequency significantly when compared to the average frequency of the same product combination. When the value of FS is greater than 1, the larger the value is, the higher the possibility of an increase in frequency in the price levels of the pattern will be. When the value of FS is smaller than 1, however, the smaller the value is, the more likely it is that the price levels will cause a frequency decrease. In addition to FS, another metric is the quantity strength (QS), which measures whether a pattern’s price levels affect its quantity significantly when compared to the average purchasing quantity. If the value of QS is greater than 1, the average quantity in transactions of that price combination is greater than the average quantity of the same product combination. If the value of QS is smaller than 1, it means the average quantity is lower. With these two metrics, the interesting patterns we want to find are those patterns whose price levels could result in either a large value or a small value in FS, QS, or both. After finding the interesting patterns, we then asked domain experts to help explain the patterns and find the true reasons underlying the phenomenon.
The rest of the paper is organized as follows. We formally define the problem in Section 2 and propose an algorithm in Section 3. In Section 4, we discuss how to conduct a series of analyses based on the two proposed metrics, FS and QS. The performance evaluation is performed in Section 5 using a real dataset. Conclusions are drawn in Section 6.
Section snippets
Problem definitions
Given a transaction database D over period T = {t1, t2,…,tk}, where ti is the ith time period, let I = {i1, i2, …, im} be the set of product items included in D. In this paper, we assume that each item may have different prices over different time periods, but in one period an item has only one price. Therefore, we use P = {p1, p2, p3, …, pw} to denote a set of price levels. For example, we can divide the prices of items into five levels, such as p1, p2, p3, p4, and p5, which may represent different
The algorithm
We propose an Apriori-like algorithm for mining all global frequent itemsets and all of their p-itemsets. The algorithm is outlined in Fig. 2.
In describing the algorithm, we use GCk to denote the set of candidate k-itemsets, GFk to denote the set of all global frequent k-itemsets, and LCk to denote the set of all candidate k-p-itemsets. In the traditional Apriori algorithm, a k-item candidate itemset must be a combination of k − 1 frequent itemsets because of the anti-monotone property (Agrawal
The analysis
After executing the pq-Apriori algorithm, we can compute the values of FS and QS for every p-itemset. Based on these values, two kinds of analyses are proposed in this section. In the first analysis, we classify all p-itemsets into nine different categories by their FS and QS values, where patterns in the same category have similar FS and QS values. In the second analysis, we design a statistical test to examine if the frequency (quantity) of a frequent itemset is sensitive to prices. Here, the
Evaluation
In this section, we use a real dataset to study the performances of the Apriori and pq-Apriori algorithms. Afterwards, we present and discuss the relevant knowledge discovered from the real dataset.
Conclusion
With advances in information technology and Internet commerce, data-driven analysis techniques have become essential for decision-making and strategy formation in business operations. It is especially critical for retail management, in both online and brick-and-mortar stores. Traditional research on mining retail knowledge focuses on assortment planning, demand correlation analysis, and customers’ shopping behavior analysis. It does not take into account the prices of products, and how price
Acknowledgements
The first author was supported in part by the MOE Program for Promoting Academic Excellence of Universities under Grant Number 91-H-FA07-1-4 and The National Science Foundation Grant Number 91-2416-H-008-003.
References (29)
- et al.
Business data mining – a machine learning perspective
Information and Management
(2001) - et al.
Mining association rules procedure to support on-line recommendation by customers and products fragmentation
Expert Systems with Applications
(2001) - et al.
Discovering time-interval sequential patterns in sequence databases
Expert Systems with Applications
(2003) - et al.
Market basket analysis in a multiple store environment
Decision Support Systems
(2005) - et al.
Mining multiple-level spatial association rules for objects with a broad boundary
Data and Knowledge Engineering
(2000) On rule interestingness measures
Knowledge-Based Systems
(1999)- et al.
Algorithms for mining association rules in bag databases
Information Sciences
(2004) - et al.
Web personalization expert with combining collaborative filtering and association rule mining technique
Expert Systems with Applications
(2001) - et al.
Mining Inter-organizational retailing knowledge for an alliance formed by competitive firms
Information and Management
(2003) - et al.
Effective personalized recommendation based on time-framed navigation clustering and association mining
Expert Systems with Applications
(2004)
Cited by (8)
Intelligent apparel product cross-selling using radio frequency identification (RFID) technology for fashion retailing
2014, Fashion Supply Chain Management Using Radio Frequency Identification (RFID) TechnologiesIntelligent product cross-selling system in fashion retailing using radio frequency identification (RFID) technology, fuzzy logic and rule-based expert system
2013, Optimizing Decision Making in the Apparel Supply Chain Using Artificial Intelligence (AI): From Production to RetailIntelligent product cross-selling system with radio frequency identification technology for retailing
2012, International Journal of Production EconomicsCitation Excerpt :As a result different kinds of systems have been developed to cater for the needs of these retailers. Some of these systems include geographical information systems (Nasirin and Birks, 2003; Tayma and Pol, 1995), inter-organizational information systems (Lin et al., 2003), and data mining systems (Bose and Mahapatra, 2001; Chen et al., 2008). They enable retailers to make decisions in such areas as replenishment, inventory control and marketing, and promotion strategies.
Decision analysis of data mining project based on Bayesian risk
2009, Expert Systems with ApplicationsEfficient Method for Mining Frequent Itemsets using Temporal Data
2018, 2018 International Conference on Information, Communication, Engineering and Technology, ICICET 2018A New Methodology for Mining Frequent Itemsets on Temporal Data
2017, IEEE Transactions on Engineering Management