skip to main content
10.1145/1150402.1150452acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Extracting redundancy-aware top-k patterns

Published: 20 August 2006 Publication History

Abstract

Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute top-k significant patterns or how to remove redundancy among patterns separately. There is limited work on finding those top-k patterns which demonstrate high-significance and low-redundancy simultaneously.In this paper, we study the problem of extracting redundancy-aware top-k patterns from a large collection of frequent patterns. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NP-hard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) (with conditions on redundancy), where k is the number of reported patterns. The direct usage of redundancy-aware top-k patterns is illustrated through two real applications: disk block prefetch and document theme extraction. Our method can also be applied to processing redundancy-aware top-k queries in traditional database.

References

[1]
F. Afrati, A. Gionis, and H. Mannila. Approximating a collection of frequent sets. Proc. of 2004 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'04), pages 12--19, 2004.]]
[2]
S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated Ranking of Database Query Results. Proc. of 2003 Biennial Conf. on Innovative Data Systems Research (CIDR'03), pages 1--12, 2003.]]
[3]
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, 1992.]]
[4]
T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. Proc. of 2002 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'02), pages 74--85, 2002.]]
[5]
J. Carbonell and J. Coldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. Proc. of the 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'98), pages 335--336, 1998.]]
[6]
S. Chaudhuri and L. Gravano. Evaluating Top-k Selection Queries. Proc. of 25th Int. Conf. on Very Large Data Bases (VLDB'99), pages 397--410, 1999.]]
[7]
E. Epkut, T. Baptie, and B. Hohenbalken. The discrete p-maxian localtion problem. Comput. & Opns. Res., 17:51--61, 1990.]]
[8]
R. Hassin, S. Rubinstein, and A. Tamir. Approximation algorithms for maximum dispersion. Operations Research Let., 21:133--137, 1997.]]
[9]
M. Holldorsson, K. Iwano, N. Katoh, and T. Tokuyama. Finding subsets maximizing minimum structures. Proc. of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'95), pages 150--159, 1995.]]
[10]
J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-k frequent closed patterns without minimum support. Proc. of the 2002 IEEE Int. Conf. on Data Mining (ICDM'02), pages 211--218, 2002.]]
[11]
S. Jaroszewicz and D. A. Simovici. A general measure of rule interestingness. Proc. of 2001 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'01), pages 253--265, 2001.]]
[12]
S. Jaroszewicz and D. A. Simovici. Interestingness of frequent itemsets using bayesian networks as background knowledge. Proc. of 2004 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'04), pages 178--186, 2004.]]
[13]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.]]
[14]
Z. Li, Z. Chen, S. Srinivasan, and Y. Zhou. Mining block correlations in storage systems. Proc. of the USENIX Conf. on File and Storage Technologies (FAST'04), pages 173--186, 2004.]]
[15]
Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proc. of 2005 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'05), pages 198--207, 2005.]]
[16]
T. Mielikäinen and H. Mannila. The pattern ordering problem. Proc. of 2003 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'03), pages 327--338, 2003.]]
[17]
D. Mount. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Lab., 2001.]]
[18]
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. Proc. of 7th Int. Conf. on Database Theory (ICDT'99), pages 398--416, 1999.]]
[19]
S. Ravi, D. Rosenkrantz, and G. Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42:299--310, 1994.]]
[20]
C. Ruemmler and J. Wilkes. Unix disk access patterns. Usenix Conf. (USENIX'93), pages 405--420, Winter, 1993.]]
[21]
X. Shen and C. Zhai. Active feedback in ad-hoc information retrieval. Proc. of the 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'05), pages 59--66, 2005.]]
[22]
A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge & Data Engineering, 8:970--974, 1996.]]
[23]
A. Singhal. Modern information retrieval: A brief overview. Bull. IEEE CS Tech. Comm. Data Eng., 24(4):35--43, 2001.]]
[24]
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. Proc. of 2002 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'02), pages 32--41, 2002.]]
[25]
D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. Proc. of 31st Int. Conf. on Very Large Data Bases (VLDB'05), pages 709--720, 2005.]]
[26]
X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: A profile-based approach. Proc. of 2005 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'05), paegs 314--323, 2005.]]
[27]
X. Yan, J. Han and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. Proc. of the Third SIAM Int. Conf. on Data Mining, San Francisco (SDM'03), 2003.]]

Cited By

View all
  • (2025)Wave Top-k Random-d Family Search: How to Guide an Expert in a Structured Pattern SpaceMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74633-8_7(104-119)Online publication date: 1-Jan-2025
  • (2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
  • (2024)Max-Min Diversification with Asymmetric DistancesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671757(1440-1450)Online publication date: 25-Aug-2024
  • Show More Cited By

Index Terms

  1. Extracting redundancy-aware top-k patterns

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2006
    986 pages
    ISBN:1595933395
    DOI:10.1145/1150402
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. pattern extraction
    2. redundancy
    3. significance

    Qualifiers

    • Article

    Conference

    KDD06

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Wave Top-k Random-d Family Search: How to Guide an Expert in a Structured Pattern SpaceMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74633-8_7(104-119)Online publication date: 1-Jan-2025
    • (2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
    • (2024)Max-Min Diversification with Asymmetric DistancesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671757(1440-1450)Online publication date: 25-Aug-2024
    • (2024)Data heterogeneity's impact on the performance of frequent itemset mining algorithmsInformation Sciences: an International Journal10.1016/j.ins.2024.120981678:COnline publication date: 1-Sep-2024
    • (2024)Online learning under one sided $$\sigma $$-smooth functionJournal of Combinatorial Optimization10.1007/s10878-024-01174-247:5Online publication date: 18-May-2024
    • (2024)WaveLSea: helping experts interactively explore pattern mining search spacesData Mining and Knowledge Discovery10.1007/s10618-024-01037-838:4(2403-2439)Online publication date: 26-May-2024
    • (2023)Discovering Association Rules with Graph Patterns in Temporal NetworksTsinghua Science and Technology10.26599/TST.2021.901009028:2(344-359)Online publication date: Apr-2023
    • (2023) Parallelized maximization of nonmonotone one-sided -smooth function Computers and Electrical Engineering10.1016/j.compeleceng.2022.108478105(108478)Online publication date: Jan-2023
    • (2022)MapReduce Framework to Improve the Efficiency of Large Scale Item Sets in IoT Using Parallel Mining of Representative Patterns in Big DataInternational Journal of Scientific Research in Science and Technology10.32628/IJSRST229618(151-161)Online publication date: 8-Nov-2022
    • (2022)Non-redundant Prevalent Co-location PatternsPreference-based Spatial Co-location Pattern Mining10.1007/978-981-16-7566-9_6(137-166)Online publication date: 4-Jan-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media