Skip to main content
Log in

Aggregating Association Rules to Improve Change Recommendation

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

As the complexity of software systems grows, it becomes increasingly difficult for developers to be aware of all the dependencies that exist between artifacts (e.g., files or methods) of a system. Change recommendation has been proposed as a technique to overcome this problem, as it suggests to a developer relevant source-code artifacts related to her changes. Association rule mining has shown promise in deriving such recommendations by uncovering relevant patterns in the system’s change history. The strength of the mined association rules is captured using a variety of interestingness measures. However, state-of-the-art recommendation engines typically use only the rule with the highest interestingness value when more than one rule applies. In contrast, we argue that when multiple rules apply, this indicates collective evidence, and aggregating those rules (and their evidence) will lead to more accurate change recommendation. To investigate this hypothesis we conduct a large empirical study of 15 open source software systems and two systems from our industry partners. We evaluate association rule aggregation using four variants of the change history for each system studied, enabling us to compare two different levels of granularity in two different scenarios. Furthermore, we study 40 interestingness measures using the rules produced by two different mining algorithms. The results show that (1) between 13 and 90% of change recommendations can be improved by rule aggregation, (2) rule aggregation almost always improves change recommendation for both algorithms and all measures, and (3) fine-grained histories benefit more from rule aggregation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Other levels of granularity are possible as our algorithms are granularity agnostic. Thus, our initial description at the file level is without loss of generality. Provided suitably co-change data the algorithms can relate methods or variables just as well as files, a fact which will be exploited later on in the paper.

  2. The three measures are: descriptive confirmed confidence, example and counterexample rate, and least contradictions. Other able measures also sometimes produced negative values, although quite rarely.

  3. Formal proofs for the three aggregator functions are provided in the Appendix.

  4. For a normally distributed population of 50 000, a minimum of 657 samples is required to attain 99% confidence with a 5% confidence interval that the sampled transactions are representative of the population. Since we do not know the distribution of transactions, we correct the sample size to the number needed for a non-parametric test to have the same ability to reject the null hypothesis. This correction is done using the Asymptotic Relative Efficiency (ARE). As AREs differ for various non-parametric tests, we choose the lowest coefficient, 0.637, yielding a conservative minimum sample size of 657/0.637 = 1032 transactions. Hence, a sample size of 1100 is more than sufficient to attain 99% confidence with a 5% confidence interval that the samples are representative of the population.

  5. Exceptions are the descriptive confirmed confidence and example and counterexample rate, where aggregation was also found to have a non-significant effect in Fig. 4.

References

  • Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), 2. ACM, pp 18–24. https://doi.org/10.1145/275487.275490

  • Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data. ACM, pp 207–216. https://doi.org/10.1145/170035.170072

  • Azė J, Kodratoff Y (2002) Evaluation de la résistance au bruit de quelques mesures d’extraction de règles d’association. In: Extraction et gestion des connaissances (EGC), vol 1. Hermes Science Publications, pp 143–154

  • Ball T, Kim J, Siy HP (1997) If your version control system could talk. In: Workshop on Process Modelling and Empirical Studies of Software Engineering, ICSE. 10.1.1.48.910

  • Baralis E, Cagliero L, Cerquitelli T, Garza P (2012) Generalized association rule mining with constraints. Inf Sci 194:68–84. https://doi.org/10.1016/j.ins.2011.05.016

    Article  Google Scholar 

  • Bayardo RJ (1998) Efficiently mining long patterns from databases. ACM SIGMOD Record 27(2):85–93. https://doi.org/10.1145/276305.276313

    Article  Google Scholar 

  • Bernard JM, Charron C (1996) Bayesian implicative analysis, a method for the study of oriented dependencies. Mathématiques. Informatique et Sci Humaines 135:5–18

    MATH  Google Scholar 

  • Beyer D, Noack A (2005) Clustering software artifacts based on frequent common changes. In: International Workshop on Program Comprehension (IWPC). IEEE, pp 259–268. https://doi.org/10.1109/WPC.2005.12

  • Bird C, Menzies T, Zimmermann T (2015) Past, present, and future of analyzing software data. In: The Art and Science of Analyzing Software Data, pp 1–13. https://doi.org/10.1016/B978-0-12-411519-4.00001-X

  • Bohner S, Arnold R (1996) Software change impact analysis. IEEE, CA, USA

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and Regression Trees, vol. 19

  • Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD International Conference on Management of Data (SIGMOD), vol 26. ACM, pp 255–264. https://doi.org/10.1145/253260.253325

  • Canfora G, Cerulo L (2005) Impact analysis by mining software and change request repositories. In: International Software Metrics Symposium (METRICS). IEEE, pp 29–37x. https://doi.org/10.1109/METRICS.2005.28

  • Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46. https://doi.org/10.1177/001316446002000104

    Article  MathSciNet  Google Scholar 

  • Cohen J (1992) A power primer. Psychol Bull 112(1):155–159. https://doi.org/10.1037/0033-2909.112.1.155

    Article  Google Scholar 

  • Collard ML, Decker MJ, Maletic JI (2013) srcML: an infrastructure for the exploration, analysis, and manipulation of source code: a tool demonstration. In: IEEE International conference on software maintenance (ICSM). IEEE, pp 516–519. https://doi.org/10.1109/ICSM.2013.85

  • Eick S, Graves TL, Karr A, Marron J, Mockus A (2001) Does code decay? Assessing the evidence from change management data. IEEE Trans Softw Eng 27(1):1–12. 10.1109/32.895984

    Article  Google Scholar 

  • Gall H, Hajek K, Jazayeri M (1998) Detection of logical coupling based on product release history. In: IEEE International conference on software maintenance (ICSM). IEEE, pp 190–198. https://doi.org/10.1109/ICSM.1998.738508

  • Geng L, Hamilton HJ (2006) Interestingness measures for data mining. ACM Computing Surveys 38(3). https://doi.org/10.1145/1132960.1132963

  • Good IJ (1966) The estimation of probabilities: an essay on modern Bayesian methods. MIT Press

  • Gray B, Orlowska ME (1998) CCAIIA: Clustering categorical attributes into interesting association rules. In: Lecture Notes in Computer Science (LNCS), vol 1394, pp 132–143. https://doi.org/10.1007/3-540-64383-4_12

  • Hassan AE, Holt R (2004) Predicting change propagation in software systems. In: IEEE International conference on software maintenance (ICSM). IEEE, pp 284–293. https://doi.org/10.1109/ICSM.2004.1357812

  • Hofmann H, Wilhelm A (2001) Visual comparison of association rules. Comput Stat 16(3):399–415. https://doi.org/10.1007/s001800100075

    Article  MathSciNet  MATH  Google Scholar 

  • Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446. https://doi.org/10.1145/582415.582418

    Article  Google Scholar 

  • Jashki MA, Zafarani R, Bagheri E (2008) Towards a more efficient static software change impact analysis method. In: ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE). ACM, pp 84–90. https://doi.org/10.1145/1512475.1512493

  • Jorge AM, Azevedo PJ (2005) An experiment with association rules and classification: post-bagging and conviction. In: Hoffmann A, Motoda H, Scheffer T (eds) Proceedings of the 8th International Conference on Discovery Science DS 2005, Lecture Notes in Computer Science, vol 3735. Springer, Berlin, pp 137–149. https://doi.org/10.1007/11563983_13

  • Kamber M, Shinghal R (1996) Evaluating the interestingness of characteristic rules. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 263–266

  • Kannan S, Bhaskaran R (2009) Association rule pruning based on interestingness measures with clustering. J Comput Sci 6(1):35–43

    Google Scholar 

  • Klösgen W (1992) Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. Int J Intell Syst 7(7):649–673. https://doi.org/10.1002/int.4550070707

    Article  MATH  Google Scholar 

  • Kodratoff Y (2001) Comparing machine learning and knowledge discovery in databases: an application to knowledge discovery in texts. In: Machine Learning and Its Applications, LNAI 2049, chap. 1. Springer, pp 1–21. https://doi.org/10.1007/3-540-44673-7_1

  • Kulczyński S (1928) Die Pflanzenassoziationen der Pieninen Imprimerie de l’université

  • Le TDB, Lo D (2015) Beyond support and confidence: exploring interestingness measures for rule-based specification mining. IEEE, pp 331–340. In: International Conference on Software Analysis, Evolution, and Reengineering (SANER). https://doi.org/10.1109/SANER.2015.7081843

  • Lin DI, Kedem ZM (1998) Pincer-search: a new algorithm for discovering the maximum frequent set. pp 103–119. https://doi.org/10.1007/BFb0100980

  • Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, pp 125–134. https://doi.org/10.1145/312129.312216

  • Loevinger J (1947) A systematic approach to the construction and evaluation of tests of ability, vol 61. https://doi.org/10.1037/h0093565

  • Lucia, Lo D, Xia X (2014) Fusion fault localizers. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering - ASE ’14. ACM Press, New York, pp 127–138. https://doi.org/10.1145/2642937.2642983

  • McGarry K (2005) A survey of interestingness measures for knowledge discovery. Knowl Eng Rev 20(01):39. https://doi.org/10.1017/S0269888905000408

    Article  Google Scholar 

  • Messaoud RB, Rabaséda S L, Boussaid O, Missaoui R (2006) Enhanced mining of association rules from data cubes. In: International Workshop on Data Warehousing and OLAP (DOLAP). ACM, p 11. https://doi.org/10.1145/1183512.1183517

  • Moonen L, Di Alesio S, Rolfsnes T, Binkley DW (2016) Exploring the effects of history length and age on mining software change impact. In: IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM), pp 207–216. https://doi.org/10.1109/SCAM.2016.9

  • Mosteller F (1968) Association and estimation in contingency tables. J Am Stat Assoc 63(321):1–28. https://doi.org/10.1080/01621459.1968.11009219

    MathSciNet  Google Scholar 

  • Pearson K (1896) Mathematical contributions to the theory of evolution. III. Regression, Heredity, and Panmixia. Philosophical Transactions of the Royal Society A: Mathematical. Phys Eng Sci 187:253–318. https://doi.org/10.1098/rsta.1896.0007

    Article  Google Scholar 

  • Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. Knowledge discovery in databases pp 229—-238

  • Podgurski A, Clarke L (1990) A formal model of program dependences and its implications for software testing, debugging, and maintenance. IEEE Trans Softw Eng 16(9):965–979. https://doi.org/10.1109/32.58784

    Article  Google Scholar 

  • Ren X, Shah F, Tip F, Ryder BG, Chesley O (2004) Chianti: a tool for change impact analysis of java programs. In: ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pp 432–448. https://doi.org/10.1145/1035292.1029012

  • Robbes R, Pollet D, Lanza M (2008) Logical coupling based on Fine-Grained change information. In: Working Conference on Reverse Engineering (WCRE). IEEE, pp 42–46. https://doi.org/10.1109/WCRE.2008.47

  • Rolfsnes T, Di Alesio S, Behjati R, Moonen L, Binkley DW (2016) Generalizing the analysis of evolutionary coupling for software change impact analysis. In: International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, pp 201–212. https://doi.org/10.1109/SANER.2016.101

  • Rolfsnes T, Moonen L, Di Alesio S, Behjati R, Binkley DW (2016) Improving change recommendation using aggregated association rules. In: International Conference on Mining Software Repositories (MSR). ACM, pp 73–84. https://doi.org/10.1145/2901739.2901756

  • Rosenthal R (1991) Meta-analytic procedures for social research. SAGE

  • Sebag M, Schoenauer M (1988) Generation of rules with certainty and confidence factors from incomplete and incoherent learning bases. In: Proceedings of the european knowledge acquisition workshop (EKAW), p 28

  • Wang S, Lo D, Jiang L, Lucia, Lau HC (2011) Search-based fault localization. In: 2011 26Th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011). IEEE, pp 556–559. https://doi.org/10.1109/ASE.2011.6100124

  • Smyth P, Goodman R (1992) An information theoretic approach to rule induction from databases. IEEE Trans Knowl Data Eng 4(4):301–316. https://doi.org/10.1109/69.149926

    Article  Google Scholar 

  • Srikant R, Vu Q, Agrawal R (1997) Mining association rules with item constraints. In: International Conference on Knowledge Discovery and Data Mining (KDD). AASI, pp 67–73

  • Tan PN, Kumar V, Srivastava J (2004) Selecting the right objective measure for association analysis. Inf Syst 29(4):293–313. https://doi.org/10.1016/S0306-4379(03)00072-3

    Article  Google Scholar 

  • Toivonen H, Klemettinen M, Ronkainen P, Hätönen K, Mannila H (1995) Pruning and grouping discovered association rules. In: Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases, pp 47–52

  • Vaillant B, Lenca P, Lallich S (2004) A Clustering of Interestingness Measures. In: Lecture Notes in Artificial Intelligence (LNAI), vol 3245, pp 290–297. https://doi.org/10.1007/978-3-540-30214-8_23

  • Van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann

  • Wu T, Chen Y, Han J (2010) Re-examination of interestingness measures in pattern mining: a unified framework. Data Min Knowl Disc 21(3):371–397. https://doi.org/10.1007/s10618-009-0161-2

    Article  MathSciNet  Google Scholar 

  • Yao YY, Zhong N (1999) An analysis of quantitative measures associated with rules. In: Methodologies for Knowledge Discovery and Data Mining (LNCS 1574). Springer, pp 479–488. https://doi.org/10.1007/3-540-48912-6_64

  • Yazdanshenas AR, Moonen L (2011) Crossing the boundaries while analyzing heterogeneous component-based software systems. In: IEEE International conference on software maintenance (ICSM). IEEE, pp 193–202. https://doi.org/10.1109/ICSM.2011.6080786

  • Ying ATT, Murphy G, Ng RT, Chu-Carroll M (2004) Predicting source code changes by mining change history. IEEE Trans Softw Eng 30(9):574–586. https://doi.org/10.1109/TSE.2004.52

    Article  Google Scholar 

  • Yong SH, Horwitz S (2002) Reducing the overhead of dynamic analysis. Electron Notes Theor Comput Sci 70(4):158–178. https://doi.org/10.1016/S1571-0661(04)80583-8

    Article  Google Scholar 

  • Yule GU (1900) On the association of attributes in statistics. Philos Trans R Soc Lond 194:257–319

    Article  MATH  Google Scholar 

  • Yule GU (1912) On the methods of measuring association between two attributes. J R Stat Soc LXXV:579–652. https://doi.org/10.2307/2340126

    Article  Google Scholar 

  • Zaki MJ (2000) Generating non-redundant association rules SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, pp 34–43. https://doi.org/10.1145/347090.347101

  • Zaki MJ, Hsiao CJ (1999) CHARM: an efficient algorithm for closed association rule mining. In: 2nd SIAM International Conference on Data Mining, pp 457–473. https://doi.org/10.1137/1.9781611972726.27

  • Zanjani MB, Swartzendruber G, Kagdi H (2014) Impact analysis of change requests on source code based on interaction and commit histories. In: International Working Conference on Mining Software Repositories (MSR), pp 162–171. https://doi.org/10.1145/2597073.2597096

  • Zhang T (2000) Association rules. In: Knowledge Discovery and Data Mining. Current Issues and New Applications, c, pp 245–256. https://doi.org/10.1007/3-540-45571-X_31

  • Zimmermann T, Zeller A, Weissgerber P, Diehl S (2005) Mining version histories to guide software changes. IEEE Trans Softw Eng 31(6):429–445. https://doi.org/10.1109/TSE.2005.72

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the Research Council of Norway through the EvolveIT project (#221751/F20) and the Certus SFI (#203461/030). Dr. Binkley is supported by NSF grant IIA-1360707 and a J. William Fulbright award.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Thomas Rolfsnes or Leon Moonen.

Additional information

Communicated by: Romain Robbes, Christian Bird, and Emily Hill

Appendix:

Appendix:

1.1 A Proofs

Within this section we formally prove that DCG and HCG satisfies the properties of Definition 4. As expressed earlier, we have limited our study of aggregation functions to positive values. We leave out the proof for CG as it is simply the algebraic sum and therefore naturally satisfies all properties of Definition 4.

1.1.1 A.1 Proof for Discounted Cumulative Gain

Theorem 1

DCG (Definition 6) satisfies the properties in Definition 4 for non-negative values of an interestingness measure M.

Proof

Let M be an interestingness measure, R be a setof rules with non-negative interestingness values, andV = [v 1,…,v n ]be an ordered list of interestingness values of rules in R for measure M, such that∀i < j.v i v j .Let r be an arbitrary rule in R, and R be equal to R ∖{r}.Then, there exists a v k V such that M(r) = v k .We define U = [u 1,…,u n− 1]as the ordered list of interestingness values of rules inR for measure M. U is equal to V except thatv k is removed from it, and we have ∀i < j.u i u j .Then:

$$\begin{array}{@{}rcl@{}} \forall 1 &\leq& i < k . v_{i} = u_{i} \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} \forall k < i &\leq& n . v_{i} = u_{i-1} \end{array} $$
(7)

The first property in Definition 4, which concerns Rs of size one, is trivial. To prove that the second property in Definition 4 holds for DCG, we compute the difference between the DCG of R and the DCG of R and show that forv k > 0, this difference is positive,and for v k = 0, this differenceis equal to zero. Let D C G(R, M) and D C G(R , M) denote the DCGs of R and R for the interestingness measure M, respectively.

$$\begin{array}{@{}rcl@{}} DCG(R, M) - DCG(R^{\prime}, M) &=& \sum\limits_{i = 1}^{n} \frac{v_{i}}{log_{2}(i + 1)} - \sum\limits_{i = 1}^{n-1} \frac{u_{i}}{log_{2}(i + 1)} \\ &=& \sum\limits_{i = 1}^{k-1} \frac{v_{i} - v_{i}}{log_{2}(i + 1)} \\ &&+ \sum\limits_{i=k}^{n} \frac{v_{i}}{log_{2}(i + 1)} - \sum\limits_{i=k}^{n-1} \frac{u_{i}}{log_{2}(i + 1)} \\ &\overset{Eq 6,7}{=}& \sum\limits_{i=k}^{n-1} \frac{v_{i} - v_{i + 1}}{log_{2}(i + 1)} + \frac{v_{n}}{log_{2}(n + 1)} \end{array} $$
(8)

Note that both terms in the last line are always non-negative. Now, we consider two cases based on thevalue of v k :

v k > 0–Since v k > 0,the two terms in (8) cannot be zero simultaneously. Because this requiresv n = 0, and at thesame time ∀ik, v i = v i+ 1. Thelatter implies v n = v k > 0,which contradicts with the former. Therefore, in this case, there is always at least one positive termin (8). Thus:

$$ DCG(R, M) - DCG(R^{\prime}, M) > 0 $$
(9)

This proves that the second property in Definition 4 holds whenM(r) = v k ispositive.

v k = 0– In this caseD C G(R, M) − D C G(R , M) = 0. This followsfrom the original assumption that the rules in V are ordered according to their absolute values. Therefore, in(8) all v i s areequal to zero. □

1.2 A.2 Proof for Hyper Cumulative Gain

We now prove that HCG satisfies the monotonicity properties of Definition 4 for non-negative values of an interestingness measure M. We start by introducing a new operator, and two lemmas that are used in the proof.

Definition 13 (Correlative sum)

Let a 1 and a 2 be real numbers. For any nonzero real number b, we define the operator S b as

$$a_{1} S_{b} a_{2} = a_{1} + \frac{b-a_{1}}{b} \cdot a_{2}. $$

Lemma 1 (Properties of correlative sum)

For any nonzero real number b, the correlative sum S b is commutative, and associative.

Proof

S b is commutative:

$$\begin{array}{@{}rcl@{}} a_{1} S_{b} a_{2} & = & a_{1} + \frac{b-a_{1}}{b} \cdot a_{2} \\ & = & \frac{a_{1} b + a_{2} b - a_{1} a_{2}}{b} \\ & = & a_{2} + \frac{b-a_{2}}{b} \cdot a_{1} \\ & = & a_{2} S_{b} a_{1} \end{array} $$

S b isassociative:

$$\begin{array}{@{}rcl@{}} (a_{1} S_{b} a_{2}) S_{b} a_{3} & = & a_{1} + \frac{b-a_{1}}{b} \cdot a_{2} + \frac{b-(a_{1} + \frac{b-a_{1}}{b} \cdot a_{2})}{b} \cdot a_{3} \\ & = & \frac{a_{1} b + a_{2} b - a_{1}a_{2}} {b} + \frac{b-(\frac{a_{1} b + a_{2} b - a_{1}a_{2}}{b})}{b} \cdot a_{3} \\ & = & \frac{a_{1} b + a_{2} b - a_{1}a_{2}}{b} + \frac{b^{2}-(a_{1} b + a_{2} b - a_{1}a_{2} )}{b^{2}} \cdot a_{3} \\ & = & \frac{a_{1} b^{2} + a_{2} b^{2} - a_{1}a_{2}b + a_{3} b^{2} - a_{1}a_{3}b - a_{2}a_{3}b + a_{1}a_{2}a_{3}} {b^{2}} \\ & = & a_{1} + \frac{b-a_{1}}{b} \cdot \frac{a_{2} b + a_{3} b - a_{2}a_{3}} {b} \\ & = & a_{1} + \frac{b-a_{1}}{b} \cdot (a_{2} + \frac{b-a_{2}}{b} \cdot a_{3}) \\ & = & a_{1} S_{b} (a_{2} S_{b} a_{3}) \end{array} $$

An important implication of this lemma is that S b can be applied to a sequence of numbers independent from the ordering of the elements in the sequence.

Lemma 2

For any nonzero real number b, let L = {l 1, l 2,...,l n } be a sequence of real numbers. Let S b (L) denote l 1 S b l 2l n− 1 S b l n . Then

$$ S_{b}(L) = l_{1} + \sum\limits_{i = 2}^{n} \left( l_{i} \cdot \prod\limits_{j = 1}^{i-1}(1-\frac{l_{j}}{b})\right) $$
(10)

and for any given real number l, we have:

$$ S_{b}(L \cup \{l\}) = S_{b}(L) + l \cdot \prod\limits_{l_{j} \in L}(1-\frac{l_{j}}{b}). $$
(11)

Proof

The proof for both parts is straightforward after expanding the polynomials.□

An implication of (11) is that, for any sequence of real numbers L = {l 1, l 2,...,l n }, and any arbitrary real number l in L, we have:

$$ S_{b}(L) - S_{b}(L \setminus \{l\}) = l \cdot \prod\limits_{l_{j} \in L \setminus \{l\}}(1-\frac{l_{j}}{b}) $$
(12)

Theorem 2

HCG (Definition 9) satisfies the properties of Definition 4 for non-negative values of an interestingness measure M.

Proof

Let M be a normalized interestingness measure with upper bound \(b \in \mathbb {R} \setminus \{0\}\),R = r u l e s be a set of rules, and L M = 〈M(r 1), ⋯M(r n )〉 be the sequence of non-negative interestingness values for the rules in R. HCG of\(\mathscr {H}(R)\)for M defined in Definition 9 can be rewritten as follows:

$$ HCG(\mathscr{H}(R), M) = \left( S_{b}(L_{M}) , m \right) $$
(13)

where m is given by

$$m = \vert\{r \in R | M(r) > 0\}\vert $$

The firstproperty in Definition 4, which concerns R s of size one, is again trivial. Let r be an arbitrary rule in R, and l denote M(r).To prove that the second property in Definition 4 holds for HCG, we compare the HCGs of R andR ∖{r}, and consider three casesbased on the value of M(r):

M(r) > 0– In this case, we need to show that the HCG of R is greater than the HCG ofR ∖{r}.

$$\begin{array}{@{}rcl@{}} HCG(\mathscr{H}(R), M) > HCG(\mathscr{H}(R \setminus \{r\}), M) & \overset{Eq 13}{\equiv}& \left( S_{b}(L_{M}) , m \right) > \left( S_{b}(L_{M} \setminus \{l\}) , m-1 \right) \\ & \overset{Def 10}{\equiv}& S_{b}(L_{M}) \geq S_{b}(L_{M} \setminus \{l\}) \end{array} $$
(14)

To show that inequality (14) holds we show thatS b (L M ) − S b (L M ∖{l})isnon-negative, which, according to (11), is equivalent to showing that

$$l \cdot \prod\limits_{l_{j} \in L_{M} \setminus \{l\}}(1-\frac{l_{j}}{b}) \geq 0 $$

This inequalityholds because l = M(r) > 0, andfor all l j L M ∖{l}the term\(1-\frac {l_{j}}{b}\)is non-negative (becausel j is at most b). Sincem > m − 1, inequality (14) holds,completing the case for M(r) > 0.

M(r) = 0– In this case, we have toshow that the HCGs of R and R ∖{r}are equal.

$$ HCG(\mathscr{H}(R), M) = HCG(\mathscr{H}(R \setminus \{r\}), M) \overline{\equiv}{Eq 13} \left( S_{b}(L_{M}) , m \right) = \left( S_{b}(L_{M} \setminus \{l\}) , m \right) $$

To do so, we have to show that S b (L M ) = S b (L M ∖{l}).This is proven by forming S b (L) − S b (L M ∖{l}),and replacing l with zero in the right-hand side of (11). Therefore, completing the proof.

So far, we have proven that HCG satisfies the properties in Definition 4 forinterestingness measures that have a finite upper bound. If the upper bound of aninterestingness measures is infinity, then like CG, HCG becomes the sum of interestingnessvalues of all rules in R. Therefore, it satisfies all the properties in Definition 4.□

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rolfsnes, T., Moonen, L., Alesio, S.D. et al. Aggregating Association Rules to Improve Change Recommendation. Empir Software Eng 23, 987–1035 (2018). https://doi.org/10.1007/s10664-017-9560-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-017-9560-y

Keywords

Navigation