skip to main content
10.1145/1081870.1081907acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Summarizing itemset patterns: a profile-based approach

Published: 21 August 2005 Publication History

Abstract

Frequent-pattern mining has been studied extensively on scalable methods for mining various kinds of patterns including itemsets, sequences, and graphs. However, the bottleneck of frequent-pattern mining is not at the efficiency but at the interpretability, due to the huge number of patterns generated by the mining process.In this paper, we examine how to summarize a collection of itemset patterns using only K representatives, a small number of patterns that a user can handle easily. The K representatives should not only cover most of the frequent patterns but also approximate their supports. A generative model is built to extract and profile these representatives, under which the supports of the patterns can be easily recovered without consulting the original dataset. Based on the restoration error, we propose a quality measure function to determine the optimal value of parameter K. Polynomial time algorithms are developed together with several optimization heuristics for efficiency improvement. Empirical studies indicate that we can obtain compact summarization in real datasets.

References

[1]
F. Afrati, A. Gionis, and H. Mannila. Approximating a collection of frequent sets. In Proc. of 2004 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'04), pages 12--19, 2004.]]
[2]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of 1993 Int. Conf. on Management of Data (SIGMOD'93), pages 207--216, 1993.]]
[3]
R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of 1995 Int. Conf. on Data Engineering (ICDE'95), pages 3--14, 1995.]]
[4]
L. Baker and A. McCallum. Distributional clustering of words for text classification. In Proc. of 1998 ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR'98), pages 96--103, 1998.]]
[5]
R. Bayardo. Efficiently mining long patterns from databases. In Proc. of 1998 Int. Conf. on Management of Data (SIGMOD'98), pages 85--93, 1998.]]
[6]
T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In Proc. of 2002 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'02), pages 74--85, 2002.]]
[7]
L. Dehaspe, H. Toivonen, and R. King. Finding frequent substructures in chemical compounds. In Proc. of 1998 Int. Conf. on Knowledge Discovery and Data Mining (KDD'98), pages 30--36, 1998.]]
[8]
M. Deshpande, M. Kuramochi, and G. Karypis. Frequent sub-structure-based approaches for classifying chemical compounds. In Proc. of 2003 Int. Conf. on Data Mining (ICDM'03), pages 35--42, 2003.]]
[9]
I. Dhillon, S. Mallela, and R. Kumar. A divisive information-theoretic feature clustering algorithm for text classification. J. of Machine Learning Research, 3:1265--1287, 2003.]]
[10]
D. Gunopulos, H. Mannila, R. Khardon, and H. Toivonen. Data mining, hypergraph transversals, and machine learning. In Proc. 1997 ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97), pages 209--216, 1997.]]
[11]
J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-k frequent closed patterns without minimum support. In Proc. of 2002 Int. Conf. on Data Mining (ICDM'02), pages 211--218, 2002.]]
[12]
W. Hoeffding. Probability inequalities for sums of bounded random variables. J. American Statistical Associations, 58:13--30, 1963.]]
[13]
L. Holder, D. Cook, and S. Djoko. Substructure discovery in the subdue system. In Proc. AAAI94 Workshop on Knowledge Discovery in Databases (KDD94), page 169--180, 1994.]]
[14]
J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. Mining spatial motifs from protein structure graphs. In Proc. of 8th Ann. Int. Conf. on Research in Computational Molecular Biology (RECOMB'04), pages 308--315, 2004.]]
[15]
M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. In Proc. of 16th Int. Conf. on Software engineering (ICSE'94), pages 191--200, 1994.]]
[16]
R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers' report: Peeling the onion. SIGKDD Explorations, 2:86--98, 2000.]]
[17]
T. Mielikainen and H. Mannila. The pattern ordering problem. In Prof. 7th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD'03), pages 327--338, 2003.]]
[18]
E. Omiecinski. Alternative interest measures for mining associations. IEEE Trans. Knowledge and Data Engineering, 15:57--69, 2003.]]
[19]
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of 7th Int. Conf. on Database Theory (ICDT'99), pages 398--416, 1999.]]
[20]
D. Pavlov, H. Mannila, and P. Smyth. Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Trans. Knowledge and Data Engineering, 15:1409--1421, 2003.]]
[21]
J. Pei, G. Dong, W. Zou, and J. Han. On computing condensed frequent pattern bases. In Proc. 2002 Int. Conf. on Data Mining (ICDM'02), pages 378-385, 2002.]]
[22]
J. Pei, A. Tung, and J. Han. Fault-tolerant frequent pattern mining: Problems and challenges. In Proc. of 2001 ACM Int. Workshop Data Mining and Knowledge Discovery (DMKD'01), pages 7--12, 2001.]]
[23]
M. Steinbach, P. Tan, and V. Kumar. Support envelopes: a technique for exploring the structure of association patterns. In Proc. of 2002 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'04), pages 296--305, 2004.]]
[24]
P. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In Proc. of 2002 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'02), pages 32--41, 2002.]]
[25]
K. Wang, C. Xu, and B. Liu. Clustering transactions using large items. In Proc. of 8th Int. Conf. on Information and Knowledge Management (CIKM'99), pages 483--490, 1999.]]
[26]
X. Yan, P. Yu, and J. Han. Graph indexing: A frequent structure-based approach. In Proc. of 2004 ACM Int. Conf. on Management of Data (SIGMOD'04), pages 335--346, 2004.]]
[27]
C. Yang, U. Fayyad, and P. S. Bradley. Efficient discovery of error-tolerant frequent itemsets in high dimensions. In Proc. of 2001 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'01), pages 194--203, 2001.]]

Cited By

View all
  • (2024)Data heterogeneity's impact on the performance of frequent itemset mining algorithmsInformation Sciences: an International Journal10.1016/j.ins.2024.120981678:COnline publication date: 1-Sep-2024
  • (2022)Summarizing Sets of Related ML-Driven Recommendations for Improving File Management in Cloud StorageProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545704(1-11)Online publication date: 29-Oct-2022
  • (2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
  • Show More Cited By

Index Terms

  1. Summarizing itemset patterns: a profile-based approach

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
    August 2005
    844 pages
    ISBN:159593135X
    DOI:10.1145/1081870
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2005

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. frequent pattern
    2. probabilistic model
    3. summarization

    Qualifiers

    • Article

    Conference

    KDD05

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Data heterogeneity's impact on the performance of frequent itemset mining algorithmsInformation Sciences: an International Journal10.1016/j.ins.2024.120981678:COnline publication date: 1-Sep-2024
    • (2022)Summarizing Sets of Related ML-Driven Recommendations for Improving File Management in Cloud StorageProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545704(1-11)Online publication date: 29-Oct-2022
    • (2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
    • (2021)Data Exploration by Representative Region SelectionMathematics of Operations Research10.1287/moor.2020.111546:3(970-1007)Online publication date: 1-Aug-2021
    • (2021)Scalability achievements for enumerative biclustering with online partitioning: Case studies involving mixed-attribute datasetsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2020.104147100(104147)Online publication date: Apr-2021
    • (2020)Knowledge Representation Model for Bodies of Knowledge Based on Design Patterns and Hierarchical GraphsComputing in Science & Engineering10.1109/MCSE.2018.287537022:2(55-63)Online publication date: Mar-2020
    • (2020)Predicting ground vibration induced by rock blasting using a novel hybrid of neural network and itemset miningNeural Computing and Applications10.1007/s00521-020-04822-wOnline publication date: 9-Mar-2020
    • (2019)Leveraging Routine Behavior and Contextually-Filtered Features for Depression Detection among College StudentsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/33512743:3(1-33)Online publication date: 9-Sep-2019
    • (2019)On the appropriate pattern frequentness measure and pattern generation modeProceedings of the 23rd International Database Applications & Engineering Symposium10.1145/3331076.3331125(1-15)Online publication date: 10-Jun-2019
    • (2019)Entropy-based Attribute Clustering2019 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT-NCON)10.1109/ECTI-NCON.2019.8692247(230-233)Online publication date: Jan-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media