Skip to main content
Log in

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine global frequent itemsets from a collection of data streams distributed at distinct remote sites. To speed up the mining process, we make the first attempt to address a new problem on continuously maintaining a global synopsis for the union of all the distributed streams. The mining results therefore can be yielded on demand by directly processing the maintained global synopsis. Instead of collecting and processing all the data in a central server, which may waste the computation resources of remote sites, distributed computations over the data streams are performed. A distributed computation framework is proposed in this paper, including two communication strategies and one merging operation. These communication strategies are designed according to an accuracy guarantee of the mining results, determining when and what the remote sites should transmit to the central server (named coordinator). On the other hand, the merging operation is exploited to merge the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and operation, the goal of continuously maintaining the global synopsis can be achieved. Rooted in the continuously maintained global synopsis, we propose a mining algorithm for finding global frequent itemsets. Moreover, the correctness guarantees of the communication strategies and merging operation, and the accuracy guarantee analysis of the mining algorithm are provided. Finally, a series of experiments on synthetic datasets and a real dataset are performed to show the effectiveness and efficiency of the distributed computation framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large databases (VLDB 1994). Santiago, Chile, pp 487–499

  • Arlitt M, Williamson C (1996) Web server workload characterization: the search for invariants. In: Proceedings performance evaluation review, vol 24, pp 126–137

  • Babcock B, Olston C (2003) Distributed top-K monitoring. In: Halevy AY, Ives ZG, Doan AH (eds) Proceedings of the 2003 ACM SIGMODE international conference on management of data (SIGMOD 2003). San Diego, California, USA, pp 28–39

  • Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Widmayer P, Ruis FT, Bueno RM, Hennessy M, Eidenbenz S, Conejo R (eds) Proceedings of the 29th international colloquium on automata, languages and programming (ICALP’02). Málaga, Spain, pp 693–703

  • Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: Proceedings of the seventh IEEE international conference on data mining (ICDM’07). Omaha, USA, pp 83–92

  • Cormode G, Garofalakis M (2005) Sketching streams through the net: distributed approximate query tracking. In: Böhm K, Jensen CS, Haas LM, Kersten ML, Larson PÅ, and Ooi BC (eds) Proceedings of the 31st international conference on very large data bases (VLDB 2005). Trondheim, Norway, pp 13–24

  • Cormode G, Garofalakis M, Muthukrishnan S, Rastogi R (2005) Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Özcan F (ed) Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD 2005). Baltimore, Maryland, USA, pp 25–36

  • Cormode G, Hadjieleftheriou M (2010) Methods for finding frequent items in data streams. VLDB J 19(1): 3–20

    Article  Google Scholar 

  • Cheng J, Ke Y, Ng W (2006) Maintaining frequent itemsets over high-speed data streams. In: Ng WK, Kitsuregawa M, Li J, Chang K (eds) Proceedings of the 10th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2006). Singapore, pp 462–467

  • Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery in databases and data mining (KDD 2003). Washington, DC, USA, pp 487–492

  • Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count- min sketch and its applications. J Algorithm 55(1): 58–75

    Article  MATH  MathSciNet  Google Scholar 

  • Cormode G, Muthukrishnan S, Zhuang W (2006) What’s different: distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: Liu L, Reuter A, Whang KY, Zhang J (eds) Proceedings of the 22nd international conference on data engineering (ICDE’06). Atlanta, GA, USA, pp 57–57

  • Cormode G, Muthukrishnan S, Zhuang W (2007) Conquering the divide: continuous clustering of distributed data streams. In: Proceedings of the 23rd international conference on data engineering (ICDE’07), Istanbul, Turkey, pp 1036–1045

  • Das A, Ganguly S, Garofalakis M, Rastogi R (2004) Distributed set-expression cardinality estimation. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, and Schiefer KB (eds) Proceedings of the thirtieth international conference on very large data bases (VLDB 2004), Toronto, Canada, pp 312–323

  • Demaine E, Lopez-Ortiz A, Munro JI (2002) Frequency estimation of Internet packet streams with limited space. In: Möhring RH, Raman R (eds) Proceedings of the 10th European symposium on algorithms (ESA 2002), Rome, Italy, pp 348–360

  • Dang XH, Ng WK, Ong KL (2008) Online mining of frequent sets in data streams with error guarantee. Knowl Inf Syst 16(2): 245–258

    Article  Google Scholar 

  • Fuller R, Kantardzic M (2008) Distributed monitoring of frequent items. Trans MLDM 1(2): 67–82

    Google Scholar 

  • Fischer MJ, Salzberg SL (1982) Finding a majority among N votes: solution to problem 81-5. J Algorithm 3(4): 362–380

    Article  Google Scholar 

  • Giannella C, Han J, Pei J, Yan X, Yu PS (2004) Mining frequent patterns in data streams at multiple time granularities. In: Kargupta H, Joshi A, Sivakumar K, Yesha Y (eds) Data mining next generation challenges and future directions. AAAI Press, Menlo Park, pp 191–212

    Google Scholar 

  • Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Disc 8(1): 53–87

    Article  MathSciNet  Google Scholar 

  • Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05), Houston, Texas, USA, pp 210–217

  • Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12th ACM international conference on information and knowledge management (CIKM’03), New Orleans, LA, USA, pp 287–294

  • Keralapura R, Cormode G, Ramamirtham J (2006) Communication-efficient distributed monitoring of thresholded counts. In: Chaudhuri S, Hristidis V, Polyzotis B (eds) Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD 2006). Chicago, Illinois, USA, pp 289–300

  • Karp RM, Papadimitriou CH, Shenker S (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1): 51–55

    Article  Google Scholar 

  • Kashyap S, Ramamirtham J, Rastoqi R, and Shukla P (2008) Efficient constraint monitoring using adaptive thresholds. In: Proceedings of IEEE 24th international conference on data engineering (ICDE’08), Cancún, México, pp 526–535

  • Lin CH, Chiu DY, Wu YH, Chen ALP (2005) Mining frequent itemsets from data streams with a time-sensitive sliding window. 2005 SIAM international conference on data mining (SDM’05), Newport Beach, CA

  • Leung CKS, Khan Q (2006) DSTree: A tree structure for the mining of frequent sets from data streams. In: Proceedings of the sixth IEEE international conference on data mining (ICDM’06), Hong Kong, China, pp 928–932

  • Li HF, Lee SY, Shan MK (2004) An efficient algorithm for mining frequent itemsets over the entire history of data streams. The first international workshop on knowledge discovery in data streams, in conjunction with ECML/PKDD 2004, Pisa, Italy

  • Metwally A, Agrawal D, Abbadi AE (2005) Efficient computation of frequent and top-k elements in data streams. In: Eiter T, Libkin L (eds) Proceedings of the 10th international conference on database theory (ICDT2005). Edinburgh, UK, pp 398–412

  • Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large databases (VLDB 2002), Hong Kong, China, pp 346–357

  • Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of IEEE 21st international conference on data engineering (ICDE’05), Tokyo, Japan, pp 767–778

  • Mozafari B, Thakkar H, Zaniolo C (2008) Verifying and mining frequent patterns from large windows over data streams. In: Proceedings of IEEE 24th international conference on data engineering (ICDE’08), Cancún, México, pp 179–188

  • Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large database. In: Dayal U, Gray PMD, Nishio S (eds) Proceedings of the 21st international conference on very large databases (VLDB 1995). Zurich, Switzerland, pp 432–444

  • Wang ET, Chen ALP (2009) A novel Hash-based approach for mining frequent itemsets over data streams requiring less memory space. Data Min Knowl Disc 19(1): 132–172

    Article  Google Scholar 

  • Yu JX, Chong Z, Lu H, Zhou A (2004) False positive or false negative: mining frequent itemsets from high speed transactional data streams. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB (eds) Proceedings of the 30th international conference on very large databases (VLDB 2004). Toronto, Canada, pp 204–215

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arbee L. P. Chen.

Additional information

Responsible editor: M.J. Zaki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, E.T., Chen, A.L.P. Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis. Data Min Knowl Disc 23, 252–299 (2011). https://doi.org/10.1007/s10618-010-0204-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-010-0204-8

Keywords

Navigation