Abstract
Frequent itemset mining over data streams becomes a hot topic in data mining and knowledge discovery in recent years, and has been applied to different areas. However, the setting of a minimum support threshold needs some domain knowledge. It will bring a lot of difficulties or much burden to users if the support threshold is not set reasonably. It is interesting for users to find top-K frequent itemsets over data streams. In this paper, a dynamical incremental approximate algorithm TOPSIL-Miner is presented to mine top-K significant itemsets in landmark windows. A new data structure, TOPSIL-Tree, is designed to store the potential significant itemsets and other data structures of maximum support list, ordered item list, TOPSET and minimum support list are devised to maintain information about mining results. Moreover, three optimal strategies are exploited to reduce time and space cost of the algorithm: (1) pruning trivial nodes in the current data stream, (2) promoting mining support threshold during mining process adaptively and heuristically, and (3) promoting pruning threshold dynamically. The accuracy of the algorithm is also analyzed. Extensive experiments are performed to evaluate the good effectiveness and the high efficiency and precision of the algorithm.
Similar content being viewed by others
References
Agrawal R, SriKant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases, pp 487–499
Babcock B, Babu S, Datar M et al (2002) Models and issues in data streams. In: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, pp 1–16
Babcock B, Olston C (2003) Distributed top-K monitoring. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, pp 28–39
Chang JH, Lee WS (2004) A sliding window method for finding recently frequent itemsets over online data streams. J Inform Sci Eng 20(2): 753–762
Chang JH, Lee WS (2006) Finding recently frequent itemsets adaptively over online transactional data streams. Inform Syst 31(8): 849–869
Cheung YL, Fu AWC (2004) Mining frequent itemsets without support threshold: with and without item constraints. IEEE Trans Knowl Data Eng 16(9): 1052–1069
Chi Y, Wang H, Yu PS, Muntz RR (2006) Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl Inform Syst 10(3): 265–294
Dang XH, Ng WK, Ong KL (2008) Online mining of frequent sets in data streams with error guarantee. Knowl Inform Syst 16: 245–258
Fu AWC, Kwong RWW, Tang J (2000) Mining N-most interesting itemsets. In: Proceedings of the international symposium on methodologies for intelligent systems, pp 59–67
Gibbons PB, Matias Y (1999) Synopsis data structures for massive data sets. In: Proceedings of the 10th annual ACM-SIAM symposium on discrete algorithms, pp 909–910
Golab L, Dehaan D, Demaine E (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of ACM SIGCOMM internet measurement conference, pp 173–178
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of ACM SIGMOD international conference on management of data, pp 1–12
Ilyas IF, Beskales G, Soliman MA (2008) A survey of top-k query processing techniques in relational database systems. ACM Comput Surv 40(4): 1–58 (Article 11)
Jia LF, Wang Z, Lu N et al (2007) RFIMiner: a regression-based algorithm for recently frequent patterns in multiple time granularity data streams. Appl Math Comput. 185(2): 769–783
Jiang N, Gruenwald L (2006) CFI-Stream: mining closed frequent itemsets in data streams. In: Proceedings of the international conference on knowledge discovery and data mining, pp 592–597
Leung CKS, Khan QI, Li Z et al (2007) CanTree: a canonical-order tree for incremental frequent-pattern mining. Knowl Inform Syst. 11(3): 287–311
Li HF, Shan MK, Lee SY (2008) DSM-FI: an efficient algorithm for mining frequent itemsets in data streams. Knowl Inform Syst. 17(1): 79–97
Lin C, Chiu D, Wu Y et al (2005) Mining frequent itemsets from data streams with a time-sensitive sliding window. In: Proceedings of the 5th international conference on data mining, pp 68–79
Liu X, Xu H, Dong Y (2006) Mining frequent closed catterns from a sliding window over data streams. J Comput Res Dev 43(10): 1738–1743 (in Chinese)
Metwally A, Agrawal D, Abbadi AE (2005) Efficient computation of frequent and top-k elements in data streams. In: Proceedings of the 10th international conference on databases theory, pp 398–412
Tzvetkov P, Yan X, Han J (2005) Tsp: Mining top-k closed sequential patterns. Knowl Inform Syst. 7(4): 438–457
Wang J, Han J, Lu Y et al (2005) TFP: an efficient algorithm for mining top-K frequent closed itemsets. IEEE Trans Knowl Data Eng. 17(5): 652–664
Wong RCW, FU AWC (2006) Mining top-K frequent itemsets from data streams. Data Mining Knowl Discov. 13(2): 193–217
Zhu Y, Dennis S (2002) StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the international conference on very large data bases, pp 358–369
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, B., Huang, H. TOPSIL-Miner: an efficient algorithm for mining top-K significant itemsets over data streams. Knowl Inf Syst 23, 225–242 (2010). https://doi.org/10.1007/s10115-009-0211-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0211-5