Mining top-K frequent itemsets from data streams

Wong, Raymond Chi-Wing; Fu, Ada Wai-Chee

doi:10.1007/s10618-006-0042-x

Mining top-K frequent itemsets from data streams

Published: 31 May 2006

Volume 13, pages 193–217, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Raymond Chi-Wing Wong¹ &
Ada Wai-Chee Fu¹

723 Accesses
54 Citations
Explore all metrics

Abstract

Frequent pattern mining on data streams is of interest recently. However, it is not easy for users to determine a proper frequency threshold. It is more reasonable to ask users to set a bound on the result size. We study the problem of mining top K frequent itemsets in data streams. We introduce a method based on the Chernoff bound with a guarantee of the output quality and also a bound on the memory usage. We also propose an algorithm based on the Lossy Counting Algorithm. In most of the experiments of the two proposed algorithms, we obtain perfect solutions and the memory space occupied by our algorithms is very small. Besides, we also propose the adapted approach of these two algorithms in order to handle the case when we are interested in mining the data in a sliding window. The experiments show that the results are accurate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Big data analytics on Apache Spark

Article 13 October 2016

Stratified random sampling from streaming and stored data

Article 23 October 2020

Notes

To obtain the K-th frequent itemset, we assume all itemsets are sorted in descending order by their frequencies. For itemsets with the same frequencies, the order is arbitrary. The K-th itemset in such a sorted list is a K-th frequent itemset.
Note that there may be more than K itemsets in the output when there are more than one itemset with frequency f _l.
In our implementation, we store the frequency/count of all itemsets, instead of the support (in fraction). So, in this step, we just need to increment the count of e in F _l by c. Similar arguments can be made for other updates on the frequency of the itemsets.
Note that in Yu et al. (2004) where Problem A is tackled, there is a bound on the memory space for the single item mining but not for the case of itemsets with multiple items.
Note that in Line 12 of the algorithm, we could use f + Δ instead of f since Δ bounds the error in f and the greatest possible frequency is f + Δ. This will ensure no false dismissal but will generate more false positives. We therefore use f instead.
As Metwally et al. (2005) only considers mining frequent items, we adapt the algorithm in Metwally et al. (2005) for mining frequent itemsets. The modification is just a straightforward approach which is similar to Manku and Motwani (2002).
Note: The meaning of \(\overline{\epsilon}\) in the algorithm is different from the error parameter ∊ in Lossy Counting Algorithm.
Yu et al. (2004) adopted the default setting of δ − 0.1, which means the probability that the real frequent itemsets are missed is at most 0.1. In this paper, we have to make a smaller default value δ − 0.05 because Theorem 2 suggests that the probability that the top K frequent itemsets are missed is at most 2 * 0.05−0.05² = 0.0975 ≈ 0.1.
In Manku and Motwani (2002), ∊ is set to be 0.1 × s, where s is the user support threshold in Problem A. In this paper, as the problem of mining top K frequent itemsets has no information of the support threshold, we adopt ∊ = 0.001 which was also used in Manku and Motwani (2002) when the support threshold was set to be 0.01.
We randomly re-order data in a unit of segment (1,000 items).

References

Agrawal, R. IBM Synthetic Data Generator, http://www.almaden.ibm.com/cs/quest/syndata.html.
Babcock, B., Datar, M., Motwani, R., and O´Callaghan, L. 2003. Maintaining variance and k-medians over data stream windows. In SIGMOD.
Babcock, B., and Olston, C. 2003. Distributed top-K monitoring. In SIGMOD.
Lee, C.-H., C.-R.L., and Chen, M.-S. 2001. Sliding-window Filtering: An Efficient algorithm for incremental mining. In Intl. Conf. on Information and Knowledge Management.
Chang, J.H. and Lee, W.S. 2003. Finding recent frequent itemsets adaptively over online data streams. In SIGKDD.
Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In 29th Intl. Colloquium on Automata, Language and Programming.
Cheung, Y.-L., and Fu, A.W.-C. 2002. An FP-tree approach for mining n-most interesting itemsets. In SPIE Conference on Data Mining.
Cheung, Y.-L., and Fu, A.W.-C. 2004. Mining frequent itemsets without support threshold: with and without item constraints. In IEEE Trans. on Knowledge and Data Engineering.
Datar, M., Gionis, A., Indyk, P., and Motwani, R. 2002. Maintaining stream statistics over sliding windows. In SIAM Journal on Computing.
Demaine, E., Lopez-Ortiz, A., and Munro, J. 2002. Frequency estimation of internet packet streams with limited space. In Proc. of 10th Annual European Symposium on Algorithms.
Fu, A.W.-C., Kwong, F.W.-W., and Tang, J. 2000. Mining N-most interesting itemesets. In ISMIS.
Giannella, C., Han, J., Pei, J., Yan, X., and Yu, P. 2003. Mining frequent patterns in data streams at multiple time granularities. In Next Generation Data Mining.
Gibbons, P.B. and Matias, Y. 1998. New sampling-based summary statistics for improving approximate query answers. In SIGMOD.
Golab, L. and Ozsu, M.T. 2003. Processing sliding window multi-joins in continuous queries over data streams. In VLDB.
Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD.
Han, J., Wang, J., Lu, Y., and Tzvetkov, P. 2002. Mining Top-K frequent closed patterns without minimum support. In ICDM.
Hidber, C. 1999. Online association rule mining. In SIGMOD.
Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-Cup 2000 Organizers Report: Peeling the Onion”. In SIGKDD Exploration 2(2).
Manku, G.S., and Motwani, R. 2002. Approximate frequency counts over data streams. In VLDB.
Metwally, A., Agrawal, D., and Abbadi: A.E. 2005. Efficient computation of frequent and top- k elements in data streams. In ICDT.
Minnesota: 98. http://www.ipums.umn.edu/usa/samples.html. In Minnesota Population Center in Univ. of Minnesota IPUMS-98.
Teng, W.-G., Chen, M.-S., and Yu, P.S. 2003. A regression-based temporal pattern mining scheme for data streams. In VLDB.
Vitter, J.S. 1985. Random sampling with a reservoir. In ACM Transactions on Mathematical Software (TOMS), 11(1).
Wong, R.C.-W. and Fu, A.W.-C. 2005a. Mining top K-frequent patterns from data streams: A study. In Technical report, Computer Science and Engineering Department, Chinese University of Hong Kong.
Wong, R.C.-W. and Fu, A.W.-C. 2005b. Mining top-K itemsets over a sliding window based on zipfian Distribution. In SIAM International Conference on Data Mining.
Xu, J., Lin, X., and Zhou, X. 2004. Space efficient quantile summary for constrained sliding windows on a data stream. In The 5th Interntaional Conference on Web-Age Information Management.
Yu, J., Chong, Z., Lu, H., and Zhou, A. 2004. False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In VLDB.

Download references

Acknowledgments

We thank Y.L. Cheung for providing us the coding of BOMO. This research was supported by the RGC Earmarked Research Grant of HKSAR CUHK 4179/01E, and the Innovation and Technology Fund (ITF) in the HKSAR [ITS/069/03].

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Ho Sin-Hang Engineering Building, Shatin N.T., Hong Kong, P. R. China
Raymond Chi-Wing Wong & Ada Wai-Chee Fu

Authors

Raymond Chi-Wing Wong
View author publications
You can also search for this author in PubMed Google Scholar
Ada Wai-Chee Fu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raymond Chi-Wing Wong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wong, R.CW., Fu, A.WC. Mining top-K frequent itemsets from data streams. Data Min Knowl Disc 13, 193–217 (2006). https://doi.org/10.1007/s10618-006-0042-x

Download citation

Received: 29 April 2005
Accepted: 01 February 2006
Published: 31 May 2006
Issue Date: September 2006
DOI: https://doi.org/10.1007/s10618-006-0042-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining top-K frequent itemsets from data streams

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data analytics on Apache Spark

Stratified random sampling from streaming and stored data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mining top-K frequent itemsets from data streams

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data analytics on Apache Spark

Stratified random sampling from streaming and stored data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation