Skip to main content
Log in

Summarizing transactional databases with overlapped hyperrectangles

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Transactional data are ubiquitous. Several methods, including frequent itemset mining and co-clustering, have been proposed to analyze transactional databases. In this work, we propose a new research problem to succinctly summarize transactional databases. Solving this problem requires linking the high level structure of the database to a potentially huge number of frequent itemsets. We formulate this problem as a set covering problem using overlapped hyperrectangles (a concept generally regarded as tile according to some existing papers); we then prove that this problem and its several variations are NP-hard, and we further reveal its relationship with the compact representation of a directed bipartite graph. We develop an approximation algorithm Hyper which can achieve a logarithmic approximation ratio in polynomial time. We propose a pruning strategy that can significantly speed up the processing of our algorithm, and we also propose an efficient algorithm Hyper+ to further summarize the set of hyperrectangles by allowing false positive conditions. Additionally, we show that hyperrectangles generated by our algorithms can be properly visualized. A detailed study using both real and synthetic datasets shows the effectiveness and efficiency of our approaches in summarizing transactional databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Afrati FN, Gionis A, Mannila H (2004) Approximating a collection of frequent sets. In: KDD, pp 12–19

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499

  • Agrawal R, Borgida A, Jagadish HV (1989) Efficient management of transitive relationships in large data, knowledge bases. In: SIGMOD Conference, pp 253–262

  • Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: SIGMOD Conference, pp 207–216

  • Agrawal A, Mannila H, Srikant R, Toivonen H, Verkamo A (1996) Fast discovery of association rules. Adv Knowl Discov Data Min 307–308

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, pp 94–105

  • Besson J, Robardet C, De Raedt L, Boulicaut J-F (2006) Mining bi-sets in numerical data. In: KDID, pp 11–23

  • Boulicaut J-F, Bykowski A, Rigotti C (2003) Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Min Knowl Discov 7(1): 5–22

    Article  MathSciNet  Google Scholar 

  • Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T (2005) Mafia: a maximal frequent itemset algorithm. IEEE Trans Knowl Data Eng 17(11): 1490–1504

    Article  Google Scholar 

  • Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206

    Article  MathSciNet  Google Scholar 

  • Chandola V, Kumar V (2007) Summarization—compressing data into an informative representation. Knowl Inf Syst 12(3): 355–378

    Article  Google Scholar 

  • Chvátal V. (1979) A greedy heuristic for the set-covering problem. Math Oper Res 4: 233–235

    Article  MATH  MathSciNet  Google Scholar 

  • Faloutsos C, Megalooikonomou V (2007) On data mining, compression, kolmogorov complexity. Data Min Knowl Discov 15(1): 3–20

    Article  MathSciNet  Google Scholar 

  • Gao Byron J, Ester M (2006) Turning clusters into patterns: rectangle-based discriminative data description. In: ICDM, pp 200–211

  • Gao Byron J, Ester M, Cai JY, Schulte O, Xiong H (2007) The minimum consistent subset cover problem, its applications in data mining. In: KDD, pp 310–319

  • Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Discovery science, pp 278–289

  • Gionis A, Mannila H, Seppänen JK (2004) Geometric, combinatorial tiles in 0-1 data. In: PKDD, pp 173–184

  • Han J, Kamber M (2006) Data mining: concepts, techniques, second edition. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Harper LH (1964) Optimal assignments of numbers to vertices. J Soc Ind Appl Math 12(1): 131–135

    Article  MATH  MathSciNet  Google Scholar 

  • Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337): 123–129

    Article  Google Scholar 

  • Jin R, Xiang Y, Fuhry D, Dragan FF (2008) Overlapping matrix pattern visualization: a hypergraph approach. In: ICDM, pp 313–322 (© IEEE, 2008. doi:10.1109/ICDM.2008.102)

  • Jin R, Xiang Y, Liu L (2009) Cartesian contour: a concise representation for a collection of frequent sets. In: KDD, pp 417–426

  • Johnson D, Krishnan S, Chhugani J, Kumar S, Venkatasubramanian S (2004) Compressing large boolean matrices using reordering techniques. In: VLDB, pp 13–23

  • Kellerer H, Pferschy U, Pisinger D (2004) Knapsack problems. Springer Verlag, New York

    MATH  Google Scholar 

  • Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson TJ (2002) The generalized mdl approach for summarization. In: VLDB, pp 766–777

  • Li T (2005) A general model for clustering binary data. In: KDD, pp 188–197

  • Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1): 24–45

    Article  Google Scholar 

  • Minoux M (1977) Accelerated greedy algorithms for maximizing submodular set functions. In: the 8th IFIP Conference on Optimization Techniques

  • Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Publishers, Boston

    MATH  Google Scholar 

  • Peeters R (2003) The maximum edge biclique problem is np-complete. Discret Appl Math 131(3): 651–654

    Article  MATH  MathSciNet  Google Scholar 

  • Pei J, Dong G, Zou W, Han J (2004) Mining condensed frequent-pattern bases. Knowl Inf Syst 6(5): 570–594

    Article  Google Scholar 

  • Richardson TJ, Urbanke RL (2008) Modern coding theory. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Safro I, Ron D, Brandt A (2006) Graph minimum linear arrangement by multilevel weighted edge contractions. J Algorithms 60(1): 24–41

    Article  MATH  MathSciNet  Google Scholar 

  • Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: SDM

  • Steinbach M, Tan P-N, Kumar V (2004) Support envelopes: a technique for exploring the structure of association patterns. In: KDD, pp 296–305

  • van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks item sets that matter. In: PKDD, pp 585–592

  • Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: KDD, pp 765–774

  • Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37

    Article  Google Scholar 

  • Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: KDD, pp 758–766 (© ACM, 2008. http://doi.acm.org/10.1145/1401890.1401981)

  • Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: VLDB, pp 709–720

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Xiang.

Additional information

Responsible editor: Bart Goethals.

A preliminary version of this article appeared in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Xiang et al. 2008). This submission has substantially extended the previous paper and contains new and major-value added contribution in comparison with the conference publication.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiang, Y., Jin, R., Fuhry, D. et al. Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Disc 23, 215–251 (2011). https://doi.org/10.1007/s10618-010-0203-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-010-0203-9

Keywords

Navigation