Skip to main content
Log in

A survey on algorithms for mining frequent itemsets over data streams

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the one-scan nature, the unbounded memory requirement and the high data arrival rate of data streams, the combinatorial explosion of itemsets exacerbates the mining task. The high complexity of the FI mining problem hinders the application of the stream mining techniques. We recognize that a critical review of existing techniques is needed in order to design and develop efficient mining algorithms and data structures that are able to match the processing rate of the mining with the high arrival rate of data streams. Within a unifying set of notations and terminologies, we describe in this paper the efforts and main techniques for mining data streams and present a comprehensive survey of a number of the state-of-the-art algorithms on mining frequent itemsets over data streams. We classify the stream-mining techniques into two categories based on the window model that they adopt in order to provide insights into how and why the techniques are useful. Then, we further analyze the algorithms according to whether they are exact or approximate and, for approximate approaches, whether they are false-positive or false-negative. We also discuss various interesting issues, including the merits and limitations in existing research and substantive areas for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Washington DC, pp 207–216

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of 20th international conference on very large data bases, Santiago de Chile, Chile, September 1994, pp 487–499

  3. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Yu P, Chen A (eds) Proceedings of the eleventh international conference on data engineering, Taipei, Taiwan, March 1995, pp 3–14

  4. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Popa L (eds) Proceedings of the twenty-first ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Wisconsin, USA, June 2002, pp 1–16

  5. Bonchi F and Lucchese C (2005). On condensed representations of constrained frequent patterns. Knowl Inf Syst 9(2): 180–201

    Article  Google Scholar 

  6. Boulicaut J, Bykowski A and Rigotti C (2003). Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Min Knowl Discov 7(1): 5–22

    Article  MathSciNet  Google Scholar 

  7. Brin S, Motwani R, Silverstein C (1997) Beyond market basket: generalizing association rules to correlations. In: Peckham J (eds) Proceedings of the ACM SIGMOD international conference on management of data, Arizona, May 1997, pp 265–276

  8. Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Elomaa T, Mannila H, Toivonen H (eds) Proceedings of the principles of data mining and knowledge discovery, 6th European conference, Helsinki, Finland, August 2002, pp 74–85

  9. Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: Getoor L, Senator T, Domingos P, Faloutsos C (eds) Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, August 2003, pp 487–492

  10. Chang JH, Lee WS (2003) stWin: adaptively monitoring the recent change of frequent itemsets over online data streams. In: Proceedings of the 2003 ACM CIKM international conference on information and knowledge management, New Orleans, Louisiana, USA, November 2003, pp 536–539

  11. Chang JH and Lee WS (2004). A sliding window method for finding recently frequent itemsets over online data streams. J Inf Sci Eng 20(4): 753–762

    Google Scholar 

  12. Charikar M, Chen K and Farach-Colton M (2004). Finding frequent items in data streams. Theor Comput Sci 312(1): 3–15

    Article  MATH  MathSciNet  Google Scholar 

  13. Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th international conference on very large data bases, Hong Kong, August 2002, pp 323–334

  14. Cheng J, Ke Y, Ng W (2006) Maintaining frequent itemsets over high-speed data streams. In: Ng WK, Kitsuregawa M, Li J, Chang K (eds) Proceedings of the 10th Pacific-asia Conference on knowledge discovery and data mining, Singapore, April 2006, pp 462–467

  15. Cheng J, Ke Y, Ng W (2006) δ-Tolerance closed frequent itemsets. In: Proceedings of the 6th IEEE international conference on data mining, Singapore, Hong Kong, December 2006, pp 139–148

  16. Chernoff H (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann Math Stat 23(4): 493–507

    Article  MathSciNet  Google Scholar 

  17. Chi Y, Wang H, Yu P, Muntz R (2004) Moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the 4th IEEE international conference on data mining, Brighton, UK, November 2004, pp 59–66

  18. Chi Y, Wang H, Yu P and Muntz R (2006). Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl Inf Syst 10(3): 265–294

    Article  Google Scholar 

  19. Cormode G, Muthukrishnan S (2003) What’s hot and what’s not: tracking most frequent items dynamically. In: Proceedings of the twenty-second ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Diego, June 2003, pp 296–306

  20. Garofalakis M, Gehrke J, Rastogi R (2002) Querying and mining data streams: you only get one look a tutorial. In: Franklin M, Moon B, Ailamaki A (eds) Proceedings of the 2002 ACM SIGMOD international conference on management of data, Wisconsin, June 2002, pp 635

  21. Giannella C, Han J, Pei J, Yan X, Yu P (2004) Mining frequent patterns in data streams at multiple time granularities. In: Kargupta H, Joshi A, Sivakumar D, Yesha Y (eds) Data mining: next generation challenges and future directions, MIT/AAAI Press, pp 191–212

  22. Goethals B, Zaki M (2003) FIMI ’03, Frequent itemset mining implementations. In: Proceedings of the ICDM 2003 workshop on frequent itemset mining implementations, December 2003, Melbourne, Florida, USA

  23. Golab L and Özsu MT (2003). Issues in data stream management. SIGMOD Rec 32(2): 5–14

    Article  Google Scholar 

  24. Gouda K, Zaki M (2001) Efficiently mining maximal frequent itemsets. In: Cercone N, Lin TY, Wu X (eds) Proceedings of the 2001 IEEE international conference on data mining, San Jose, 29 November – 2 December 2001, pp 163–170

  25. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the 2000 ACM SIGMOD international conference on management of data, Texas, May 2000, pp 1–12

  26. Hidber C (1999) Online association rule mining. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Philadelphia, Pennsylvania, June 1999, pp 145–156

  27. Jin C, Qian W, Sha C, Yu J, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 2003 ACM CIKM international conference on information and knowledge management, New Orleans, Louisiana, USA, November 2003, pp 287–294

  28. Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the 5th IEEE international conference on data mining, Houston, Texas, USA, November 2005, pp 210–217

  29. Lee D, Lee W (2005) Finding maximal frequent itemsets over online data streams adaptively. In: Proceedings of the 5th IEEE international conference on data mining, Houston, Texas, USA, November 2005, pp 266–273

  30. Lee C, Lin C, Chen M (2001) Sliding-window filtering: an efficient algorithm for incremental mining. In: Proceedings of the 2001 ACM CIKM international conference on information and knowledge management, Atlanta, Georgia, USA, November 2001, pp 263–270

  31. Li H, Lee S, Shan M (2004) An efficient algorithm for mining frequent itemsets over the entire history of data streams. In: Proceedings of the first international workshop on knowledge discovery in data streams, in conjunction with the 15th European conference on machine learning ECML and the 8th European conference on the principals and practice of knowledge discovery in databases PKDD, Pisa, Italy, 2004

  32. Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining, New York, August 1998, pp 80–86

  33. Manjhi A, Shkapenyuk V, Dhamdhere K , Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of the 21st international conference on data engineering, Tokyo, Japan, April 2005, pp 767–778

  34. Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large data bases, Hong Kong, August 2002, pp 346–357

  35. Mannila H, Toivonen H and Verkamo AI (1997). Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3): 259–289

    Article  Google Scholar 

  36. Omiecinski E (2003). Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69

    Article  MathSciNet  Google Scholar 

  37. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Beeri C, Buneman P (eds) Proceedings of the 7th international conference on database theory, Jerusalem, Israel, January 1999, pp 398–416

  38. Pavan A, Tirthapura S (2005) Range efficient computation of F0 over massive data streams. In: Proceedings of the 21st international conference on data engineering, Tokyo, Japan, April 2005, pp 32–43

  39. Pei J, Dong G, Zou W and Han J (2004). Mining condensed frequent-pattern bases. Knowl Inf Syst 6(5): 570–594

    Article  Google Scholar 

  40. Srivastava U, Widom J (2004) Memory-limited execution of windowed stream joins. In: Nascimento et al. (eds) Proceedings of the thirtieth international conference on very large data bases, Toronto, Canada, August 31 – September 3 2004, pp 324–335

  41. Toivonen H (1996) Sampling large databases for association rules. In: Vijayaraman TM, Buchmann A, Mohan C, Sarda N (eds) Proceedings of the 22nd international conference on very large data bases, Mumbai (Bombay), India, September 1996, pp 134–145

  42. Wang J, Han J, Pei J (2003) CLOSET + : searching for the best strategies for mining frequent closed itemsets. In: Getoor L, Senator T, Domingos P, Faloutsos C (eds) Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, August 2003, pp 236–245

  43. Wang H, Yang J, Wang W, Yu P (2002) Clustering by pattern similarity in large data sets. In: Franklin M, Moon B, Ailamaki A (eds) Proceedings of the 2002 ACM SIGMOD international conference on management of data, Wisconsin, June 2002, pp 394–405

  44. Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: BÖhm et al. (eds) Proceedings of the 31st international conference on very large data bases, Trondheim, Norway, September 2–August 30, 2005, pp 709–720

  45. Yu J, Chong Z, Lu H, Zhou A (2004) False positive or false negative: mining frequent itemsets from high speed transactional data streams. In: Nascimento et al. (eds) Proceedings of the thirtieth international conference on very large data bases, Toronto, Canada, September 3–August 31, 2004, pp 204–215

  46. Zaki M (2000) Generating non-redundant association rules. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, August 2000, pp 34–43

  47. Zaki M, Hsiao CJ (2002) CHARM: an efficient algorithm for closed itemset mining. In: Grossman et al. (eds) Proceedings of the second SIAM international conference on data mining, Arlington, VA, USA, April 2002

  48. Zaki M, Parthasarathy S, Li W, Ogihara M (1997) Evaluation of sampling for data mining of association rules. In: Proceedings of the research issues in data engineering, Birmingham, England, 1997

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, J., Ke, Y. & Ng, W. A survey on algorithms for mining frequent itemsets over data streams. Knowl Inf Syst 16, 1–27 (2008). https://doi.org/10.1007/s10115-007-0092-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0092-4

Keywords

Navigation