Frequent Pattern Mining in Data Streams

Jin, Ruoming; Agrawal, Gagan

doi:10.1007/978-0-387-47534-9_4

Frequent Pattern Mining in Data Streams

Ruoming Jin³ &
Gagan Agrawal⁴

Chapter

3014 Accesses
13 Citations

Part of the book series: Advances in Database Systems ((ADBS,volume 31))

Abstract

Frequent pattern mining is a core data mining operation and has been extensively studied over the last decade. Recently, mining frequent patterns over data streams have attracted a lot of research interests. Compared with other streaming queries, frequent pattern mining poses great challenges due to high memory and computational costs, and accuracy requirement of the mining results.

In this chapter, we overview the state-of-art techniques to mine frequent patterns over data streams. We also introduce a new approach for this problem, which makes two major contributions. First, this one pass algorithm for frequent itemset mining has deterministic bounds on the accuracy, and does not require any out-of-core summary structure. Second, because the one pass algorithm does not produce any false negatives, it can be easily extended to a two pass accurate algorithm. The two pass algorithm is very memory efficient.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal, H. Mannila, R. Srikant, H. Toivonent, and A. Inkeri Verkamo. Fast discovery of association rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press, Menlo Park, CA, 1996.
Google Scholar
Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD Conference, pages 207–216, May 1993.
Google Scholar
Tatsuya Asai, Hiroki Arimura, Kenji Abe, Shinji Kawasoe, and Setsuo Arikawa. Online algorithms for mining semi-structured data stream. In ICDM, pages 27–34, 2002.
Google Scholar
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems. In Proceedings of the 2002 ACM Symposium on Principles of Database Systems (PODS 2002) (Invited Paper). ACM Press, June 2002.
Google Scholar
B. Babcock, S. Chaudhuri, and G. Das. Dynamic Sampling for Approximate Query Processing. In Proceedings of the 2003 ACM SIGMOD Conference. ACM Press, June 2003.
Google Scholar
Herve; Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, and Peter Scheuermann. Efficient data reduction with ease. In KDD’ 03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 59–68, 2003.
Google Scholar
Joong Hyuk Chang and Won Suk Lee. Finding recent frequent itemsets adaptively over online data streams. In KDD’ 03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003.
Google Scholar
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In ICALP’ 02: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, 2002.
Google Scholar
Bin Chen, Peter Haas, and Peter Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In KDD’ 02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 462–468, 2002.
Google Scholar
D. Cheung, J. Han, V. NG, and C. Wong. Maintenance of discovered association rules in large databases: an incremental updating technique. In ICDE, 1996.
Google Scholar
Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In ICDM, pages 59–66, 2004.
Google Scholar
Yun Chi, Yirong Yang, and Richard R. Muntz. Hybridtreeminer: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In The 16th International Conference on Scientific and Statistical Database Management (SSDBM’04), 2004.
Google Scholar
G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing Data Streams Using Hamming Norms. In Proceedings of Conference on Very Large Data Bases (VLDB), pages 335–345, 2002.
Google Scholar
Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in data streams. In VLDB, pages 464–475, 2003.
Google Scholar
C. Giannella, Jiawei Han, Jian Pei, Xifeng Yan, and P. S. Yu. Mining Frequent Patterns in Data Streams at Multiple Time Granularities. In Proceedings of the NSF Workshop on Next Generation Data Mining, November 2002.
Google Scholar
Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics for improving approximate query answers. In ACM SIGMOD, pages 331–342, 1998.
Google Scholar
Bart Goethals and Mohammed J. Zaki. Workshop Report on Workshop on Frequent Itemset Mining Implementations (FIMI). 2003.
Google Scholar
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2000.
Google Scholar
C. Hidber. Online Association Rule Mining. In Proceedings of ACM SIGMOD Conference on Management of Data, pages 145–156. ACM Press, 1999.
Google Scholar
Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink, Jan Prins, and Alexander Tropsha. Mining protein family-specific residue packing patterns from protein structure graphs. In Eighth International Conference on Research in Computational Molecular Biology (RECOMB), pages 308–315, 2004.
Google Scholar
Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Knowledge Discovery and Data Mining (PKDD2000), pages 13–23, 2000.
Google Scholar
R. Jin and G. Agrawal. An algorithm for in-core frequent itemset mining on streaming data. In ICDM, November 2005.
Google Scholar
Ruoming Jin and Gagan Agrawal. An algorithm for in-core frequent itemset mining on streaming data. Technical Report OSU-CISRC-2/04-TR14, Ohio State University, 2004.
Google Scholar
Ruoming Jin and Gagan Agrawal. A systematic approach for optimizing complex mining tasks on multiple datasets. In Proceedings of ICDE, 2005.
Google Scholar
Richard M. Karp, Christos H. Papadimitriou, and Scott Shanker. A Simple Algorithm for Finding Frequent Elements in Streams and Bags. Available from http://www.cs.berkeley.edu/christos/iceberg.ps, 2002.
Google Scholar
Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In ICDM’ 01: Proceedings of the 2001 IEEE International Conference on Data Mining, pages 313–320, 2001.
Google Scholar
Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, and Christopher Olston. Finding (recently) frequent items in distributed data streams. In ICDE’ 05: Proceedings of the 21st International Conference on Data Engineering (ICDE’05), pages 767–778, 2005.
Google Scholar
G. S. Manku and R. Motwani. Approximate Frequency Counts Over Data Streams. In Proceedings of Conference on Very Large DataBases (VLDB), pages 346–357, 2002.
Google Scholar
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In 21th VLDB Conf., 1995.
Google Scholar
Wei-Guang Teng, Ming-Syan Chen, and Philip S. Yu. A regression-based temporal pattern mining scheme for data streams. In VLDB, pages 93–104, 2003.
Google Scholar
H. Toivonen. Sampling large databases for association rules. In 22nd VLDB Conf., 1996.
Google Scholar
Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern sets. In VLDB, pages 709–720, 2005.
Google Scholar
Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In ICDM’ 02: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02), page 721, 2002.
Google Scholar
Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, and Aoying Zhou. False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, Aug 2004.
Google Scholar
M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal, 1(4):343–373, December 1997.
Article Google Scholar
Mohammed J. Zaki. Efficiently mining frequent trees in a forest. In KDD’ 02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71–80, 2002.
Google Scholar
Mohammed J. Zaki and Charu C. Aggarwal. Xrules: an effective structural classifier for xml data. In KDD’ 03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316–325, 2003.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Kent State University, USA
Ruoming Jin
Department of Computer Science and Engineering, The Ohio State University, USA
Gagan Agrawal

Authors

Ruoming Jin
View author publications
You can also search for this author in PubMed Google Scholar
Gagan Agrawal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IBM, Thomas J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY, 10532
Charu C. Aggarwal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jin, R., Agrawal, G. (2007). Frequent Pattern Mining in Data Streams. In: Aggarwal, C.C. (eds) Data Streams. Advances in Database Systems, vol 31. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-47534-9_4

Download citation

DOI: https://doi.org/10.1007/978-0-387-47534-9_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-28759-1
Online ISBN: 978-0-387-47534-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics