Article

Extracting redundancy-aware top-k patterns

Authors:

Jiawei HanAuthors Info & Claims

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 444 - 453

https://doi.org/10.1145/1150402.1150452

Published: 20 August 2006 Publication History

Abstract

Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute top-k significant patterns or how to remove redundancy among patterns separately. There is limited work on finding those top-k patterns which demonstrate high-significance and low-redundancy simultaneously.In this paper, we study the problem of extracting redundancy-aware top-k patterns from a large collection of frequent patterns. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NP-hard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) (with conditions on redundancy), where k is the number of reported patterns. The direct usage of redundancy-aware top-k patterns is illustrated through two real applications: disk block prefetch and document theme extraction. Our method can also be applied to processing redundancy-aware top-k queries in traditional database.

References

[1]

F. Afrati, A. Gionis, and H. Mannila. Approximating a collection of frequent sets. Proc. of 2004 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'04), pages 12--19, 2004.]]

Digital Library

[2]

S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated Ranking of Database Query Results. Proc. of 2003 Biennial Conf. on Innovative Data Systems Research (CIDR'03), pages 1--12, 2003.]]

[3]

Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, 1992.]]

Digital Library

[4]

T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. Proc. of 2002 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'02), pages 74--85, 2002.]]

Digital Library

[5]

J. Carbonell and J. Coldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. Proc. of the 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'98), pages 335--336, 1998.]]

Digital Library

[6]

S. Chaudhuri and L. Gravano. Evaluating Top-k Selection Queries. Proc. of 25th Int. Conf. on Very Large Data Bases (VLDB'99), pages 397--410, 1999.]]

Digital Library

[7]

E. Epkut, T. Baptie, and B. Hohenbalken. The discrete p-maxian localtion problem. Comput. & Opns. Res., 17:51--61, 1990.]]

Digital Library

[8]

R. Hassin, S. Rubinstein, and A. Tamir. Approximation algorithms for maximum dispersion. Operations Research Let., 21:133--137, 1997.]]

Digital Library

[9]

M. Holldorsson, K. Iwano, N. Katoh, and T. Tokuyama. Finding subsets maximizing minimum structures. Proc. of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'95), pages 150--159, 1995.]]

Digital Library

[10]

J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-k frequent closed patterns without minimum support. Proc. of the 2002 IEEE Int. Conf. on Data Mining (ICDM'02), pages 211--218, 2002.]]

Digital Library

[11]

S. Jaroszewicz and D. A. Simovici. A general measure of rule interestingness. Proc. of 2001 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'01), pages 253--265, 2001.]]

Digital Library

[12]

S. Jaroszewicz and D. A. Simovici. Interestingness of frequent itemsets using bayesian networks as background knowledge. Proc. of 2004 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'04), pages 178--186, 2004.]]

Digital Library

[13]

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.]]

Digital Library

[14]

Z. Li, Z. Chen, S. Srinivasan, and Y. Zhou. Mining block correlations in storage systems. Proc. of the USENIX Conf. on File and Storage Technologies (FAST'04), pages 173--186, 2004.]]

Digital Library

[15]

Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proc. of 2005 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'05), pages 198--207, 2005.]]

Digital Library

[16]

T. Mielikäinen and H. Mannila. The pattern ordering problem. Proc. of 2003 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'03), pages 327--338, 2003.]]

[17]

D. Mount. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Lab., 2001.]]

[18]

N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. Proc. of 7th Int. Conf. on Database Theory (ICDT'99), pages 398--416, 1999.]]

Digital Library

[19]

S. Ravi, D. Rosenkrantz, and G. Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42:299--310, 1994.]]

Digital Library

[20]

C. Ruemmler and J. Wilkes. Unix disk access patterns. Usenix Conf. (USENIX'93), pages 405--420, Winter, 1993.]]

[21]

X. Shen and C. Zhai. Active feedback in ad-hoc information retrieval. Proc. of the 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'05), pages 59--66, 2005.]]

Digital Library

[22]

A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge & Data Engineering, 8:970--974, 1996.]]

Digital Library

[23]

A. Singhal. Modern information retrieval: A brief overview. Bull. IEEE CS Tech. Comm. Data Eng., 24(4):35--43, 2001.]]

[24]

P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. Proc. of 2002 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'02), pages 32--41, 2002.]]

Digital Library

[25]

D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. Proc. of 31st Int. Conf. on Very Large Data Bases (VLDB'05), pages 709--720, 2005.]]

Digital Library

[26]

X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: A profile-based approach. Proc. of 2005 ACM Int. Conf. on Knowledge Discovery in Databases (KDD'05), paegs 314--323, 2005.]]

Digital Library

[27]

X. Yan, J. Han and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. Proc. of the Third SIAM Int. Conf. on Data Mining, San Francisco (SDM'03), 2003.]]

Cited By

Lehembre ECremilleux BCuissart BOuali AZimmermann A(2025)Wave Top-k Random-d Family Search: How to Guide an Expert in a Structured Pattern SpaceMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74633-8_7(104-119)Online publication date: 1-Jan-2025
https://doi.org/10.1007/978-3-031-74633-8_7
Fan WHan ZXie MZhang G(2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
https://doi.org/10.1145/3677131
Kumpulainen IAdriaens FTatti NBaeza-Yates RBonchi F(2024)Max-Min Diversification with Asymmetric DistancesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671757(1440-1450)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671757
Show More Cited By

Index Terms

Extracting redundancy-aware top-k patterns
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Efficiently handling feature redundancy in high-dimensional data
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

High-dimensional data poses a severe challenge for data mining. Feature selection is a frequently used technique in pre-processing high-dimensional data for successful data mining. Traditionally, feature selection is focused on removing irrelevant ...
Extracting Sequential Patterns from Progressive Databases: A Weighted Approach
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Research on pattern mining has deduced that progressive sequential pattern mining approach can be used to obtain the most updated frequent sequential patterns.However, no existing sequential pattern mining algorithms provide a metric to quantify the ...
Efficient Redundancy Elimination to Discovering Concise Prevalent Co-location Patterns
Knowledge Management and Acquisition for Intelligent Systems
Abstract
The mining of prevalent co-location patterns (PCPs) is a pivotal task in spatial data analysis, providing insights into the co-occurrence relationships among spatial features. Despite its importance, traditional frameworks for co-location pattern ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2006

986 pages

ISBN:1595933395

DOI:10.1145/1150402

Conference Chair:
Tina Eliassi-Rad
LLNL
,
General Chair:
Lyle Ungar
University of Pennsylvania
,
Program Chairs:
Mark Craven
University of Wisconsin
,
Dimitrios Gunopulos
University of California, Riverside

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD06

Sponsor:

KDD06: The 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 20 - 23, 2006

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

91
Total Citations
View Citations
955
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lehembre ECremilleux BCuissart BOuali AZimmermann A(2025)Wave Top-k Random-d Family Search: How to Guide an Expert in a Structured Pattern SpaceMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74633-8_7(104-119)Online publication date: 1-Jan-2025
https://doi.org/10.1007/978-3-031-74633-8_7
Fan WHan ZXie MZhang G(2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
https://doi.org/10.1145/3677131
Kumpulainen IAdriaens FTatti NBaeza-Yates RBonchi F(2024)Max-Min Diversification with Asymmetric DistancesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671757(1440-1450)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671757
Trasierras ALuna JFournier-Viger PVentura S(2024)Data heterogeneity's impact on the performance of frequent itemset mining algorithmsInformation Sciences: an International Journal10.1016/j.ins.2024.120981678:COnline publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120981
Zhang HXu DGai LZhang Z(2024)Online learning under one sided $$\sigma $$-smooth functionJournal of Combinatorial Optimization10.1007/s10878-024-01174-247:5Online publication date: 18-May-2024
https://doi.org/10.1007/s10878-024-01174-2
Lehembre ECremilleux BZimmermann ACuissart BOuali A(2024)WaveLSea: helping experts interactively explore pattern mining search spacesData Mining and Knowledge Discovery10.1007/s10618-024-01037-838:4(2403-2439)Online publication date: 26-May-2024
https://doi.org/10.1007/s10618-024-01037-8
Huang CZhang QGuo DZhao XWang X(2023)Discovering Association Rules with Graph Patterns in Temporal NetworksTsinghua Science and Technology10.26599/TST.2021.901009028:2(344-359)Online publication date: Apr-2023
https://doi.org/10.26599/TST.2021.9010090
Zhang HXu DZhang YZhang Z(2023) Parallelized maximization of nonmonotone one-sided -smooth function Computers and Electrical Engineering10.1016/j.compeleceng.2022.108478105(108478)Online publication date: Jan-2023
https://doi.org/10.1016/j.compeleceng.2022.108478
A Geetha Ravindra Changala Goda Gangaram Dr. Mahesh Kotha (2022)MapReduce Framework to Improve the Efficiency of Large Scale Item Sets in IoT Using Parallel Mining of Representative Patterns in Big DataInternational Journal of Scientific Research in Science and Technology10.32628/IJSRST229618(151-161)Online publication date: 8-Nov-2022
https://doi.org/10.32628/IJSRST229618
Wang LFang YZhou LWang LFang YZhou L(2022)Non-redundant Prevalent Co-location PatternsPreference-based Spatial Co-location Pattern Mining10.1007/978-981-16-7566-9_6(137-166)Online publication date: 4-Jan-2022
https://doi.org/10.1007/978-981-16-7566-9_6
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten