Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Wang, En Tzu; Chen, Arbee L. P.

doi:10.1007/s10618-010-0204-8

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Published: 17 November 2010

Volume 23, pages 252–299, (2011)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

En Tzu Wang¹ &
Arbee L. P. Chen²

756 Accesses
12 Citations
Explore all metrics

Abstract

Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine global frequent itemsets from a collection of data streams distributed at distinct remote sites. To speed up the mining process, we make the first attempt to address a new problem on continuously maintaining a global synopsis for the union of all the distributed streams. The mining results therefore can be yielded on demand by directly processing the maintained global synopsis. Instead of collecting and processing all the data in a central server, which may waste the computation resources of remote sites, distributed computations over the data streams are performed. A distributed computation framework is proposed in this paper, including two communication strategies and one merging operation. These communication strategies are designed according to an accuracy guarantee of the mining results, determining when and what the remote sites should transmit to the central server (named coordinator). On the other hand, the merging operation is exploited to merge the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and operation, the goal of continuously maintaining the global synopsis can be achieved. Rooted in the continuously maintained global synopsis, we propose a mining algorithm for finding global frequent itemsets. Moreover, the correctness guarantees of the communication strategies and merging operation, and the accuracy guarantee analysis of the mining algorithm are provided. Finally, a series of experiments on synthetic datasets and a real dataset are performed to show the effectiveness and efficiency of the distributed computation framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Distributed Mining Algorithm of Maximum Frequent Itemsets Based on Cloud Computing

Mining frequent items and itemsets from distributed data streams for emergency detection and management

Article 29 January 2016

The Algorithm for Mining Global Frequent Itemsets Based on Cloud Computing

References

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large databases (VLDB 1994). Santiago, Chile, pp 487–499
Arlitt M, Williamson C (1996) Web server workload characterization: the search for invariants. In: Proceedings performance evaluation review, vol 24, pp 126–137
Babcock B, Olston C (2003) Distributed top-K monitoring. In: Halevy AY, Ives ZG, Doan AH (eds) Proceedings of the 2003 ACM SIGMODE international conference on management of data (SIGMOD 2003). San Diego, California, USA, pp 28–39
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Widmayer P, Ruis FT, Bueno RM, Hennessy M, Eidenbenz S, Conejo R (eds) Proceedings of the 29th international colloquium on automata, languages and programming (ICALP’02). Málaga, Spain, pp 693–703
Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: Proceedings of the seventh IEEE international conference on data mining (ICDM’07). Omaha, USA, pp 83–92
Cormode G, Garofalakis M (2005) Sketching streams through the net: distributed approximate query tracking. In: Böhm K, Jensen CS, Haas LM, Kersten ML, Larson PÅ, and Ooi BC (eds) Proceedings of the 31st international conference on very large data bases (VLDB 2005). Trondheim, Norway, pp 13–24
Cormode G, Garofalakis M, Muthukrishnan S, Rastogi R (2005) Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Özcan F (ed) Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD 2005). Baltimore, Maryland, USA, pp 25–36
Cormode G, Hadjieleftheriou M (2010) Methods for finding frequent items in data streams. VLDB J 19(1): 3–20
Article Google Scholar
Cheng J, Ke Y, Ng W (2006) Maintaining frequent itemsets over high-speed data streams. In: Ng WK, Kitsuregawa M, Li J, Chang K (eds) Proceedings of the 10th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2006). Singapore, pp 462–467
Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery in databases and data mining (KDD 2003). Washington, DC, USA, pp 487–492
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count- min sketch and its applications. J Algorithm 55(1): 58–75
Article MATH MathSciNet Google Scholar
Cormode G, Muthukrishnan S, Zhuang W (2006) What’s different: distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: Liu L, Reuter A, Whang KY, Zhang J (eds) Proceedings of the 22nd international conference on data engineering (ICDE’06). Atlanta, GA, USA, pp 57–57
Cormode G, Muthukrishnan S, Zhuang W (2007) Conquering the divide: continuous clustering of distributed data streams. In: Proceedings of the 23rd international conference on data engineering (ICDE’07), Istanbul, Turkey, pp 1036–1045
Das A, Ganguly S, Garofalakis M, Rastogi R (2004) Distributed set-expression cardinality estimation. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, and Schiefer KB (eds) Proceedings of the thirtieth international conference on very large data bases (VLDB 2004), Toronto, Canada, pp 312–323
Demaine E, Lopez-Ortiz A, Munro JI (2002) Frequency estimation of Internet packet streams with limited space. In: Möhring RH, Raman R (eds) Proceedings of the 10th European symposium on algorithms (ESA 2002), Rome, Italy, pp 348–360
Dang XH, Ng WK, Ong KL (2008) Online mining of frequent sets in data streams with error guarantee. Knowl Inf Syst 16(2): 245–258
Article Google Scholar
Fuller R, Kantardzic M (2008) Distributed monitoring of frequent items. Trans MLDM 1(2): 67–82
Google Scholar
Fischer MJ, Salzberg SL (1982) Finding a majority among N votes: solution to problem 81-5. J Algorithm 3(4): 362–380
Article Google Scholar
Giannella C, Han J, Pei J, Yan X, Yu PS (2004) Mining frequent patterns in data streams at multiple time granularities. In: Kargupta H, Joshi A, Sivakumar K, Yesha Y (eds) Data mining next generation challenges and future directions. AAAI Press, Menlo Park, pp 191–212
Google Scholar
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Disc 8(1): 53–87
Article MathSciNet Google Scholar
Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05), Houston, Texas, USA, pp 210–217
Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12th ACM international conference on information and knowledge management (CIKM’03), New Orleans, LA, USA, pp 287–294
Keralapura R, Cormode G, Ramamirtham J (2006) Communication-efficient distributed monitoring of thresholded counts. In: Chaudhuri S, Hristidis V, Polyzotis B (eds) Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD 2006). Chicago, Illinois, USA, pp 289–300
Karp RM, Papadimitriou CH, Shenker S (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1): 51–55
Article Google Scholar
Kashyap S, Ramamirtham J, Rastoqi R, and Shukla P (2008) Efficient constraint monitoring using adaptive thresholds. In: Proceedings of IEEE 24th international conference on data engineering (ICDE’08), Cancún, México, pp 526–535
Lin CH, Chiu DY, Wu YH, Chen ALP (2005) Mining frequent itemsets from data streams with a time-sensitive sliding window. 2005 SIAM international conference on data mining (SDM’05), Newport Beach, CA
Leung CKS, Khan Q (2006) DSTree: A tree structure for the mining of frequent sets from data streams. In: Proceedings of the sixth IEEE international conference on data mining (ICDM’06), Hong Kong, China, pp 928–932
Li HF, Lee SY, Shan MK (2004) An efficient algorithm for mining frequent itemsets over the entire history of data streams. The first international workshop on knowledge discovery in data streams, in conjunction with ECML/PKDD 2004, Pisa, Italy
Metwally A, Agrawal D, Abbadi AE (2005) Efficient computation of frequent and top-k elements in data streams. In: Eiter T, Libkin L (eds) Proceedings of the 10th international conference on database theory (ICDT2005). Edinburgh, UK, pp 398–412
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large databases (VLDB 2002), Hong Kong, China, pp 346–357
Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of IEEE 21st international conference on data engineering (ICDE’05), Tokyo, Japan, pp 767–778
Mozafari B, Thakkar H, Zaniolo C (2008) Verifying and mining frequent patterns from large windows over data streams. In: Proceedings of IEEE 24th international conference on data engineering (ICDE’08), Cancún, México, pp 179–188
Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large database. In: Dayal U, Gray PMD, Nishio S (eds) Proceedings of the 21st international conference on very large databases (VLDB 1995). Zurich, Switzerland, pp 432–444
Wang ET, Chen ALP (2009) A novel Hash-based approach for mining frequent itemsets over data streams requiring less memory space. Data Min Knowl Disc 19(1): 132–172
Article Google Scholar
Yu JX, Chong Z, Lu H, Zhou A (2004) False positive or false negative: mining frequent itemsets from high speed transactional data streams. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB (eds) Proceedings of the 30th international conference on very large databases (VLDB 2004). Toronto, Canada, pp 204–215

Download references

Author information

Authors and Affiliations

Cloud Computing Center for Mobile Applications, Industrial Technology Research Institute, Hsinchu, Taiwan, ROC
En Tzu Wang
Department of Computer Science, National Chengchi University, Taipei, Taiwan, ROC
Arbee L. P. Chen

Authors

En Tzu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Arbee L. P. Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arbee L. P. Chen.

Additional information

Responsible editor: M.J. Zaki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, E.T., Chen, A.L.P. Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis. Data Min Knowl Disc 23, 252–299 (2011). https://doi.org/10.1007/s10618-010-0204-8

Download citation

Received: 15 July 2009
Accepted: 26 October 2010
Published: 17 November 2010
Issue Date: September 2011
DOI: https://doi.org/10.1007/s10618-010-0204-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Abstract

Access this article

Similar content being viewed by others

Fast Distributed Mining Algorithm of Maximum Frequent Itemsets Based on Cloud Computing

Mining frequent items and itemsets from distributed data streams for emergency detection and management

The Algorithm for Mining Global Frequent Itemsets Based on Cloud Computing

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Abstract

Access this article

Similar content being viewed by others

Fast Distributed Mining Algorithm of Maximum Frequent Itemsets Based on Cloud Computing

Mining frequent items and itemsets from distributed data streams for emergency detection and management

The Algorithm for Mining Global Frequent Itemsets Based on Cloud Computing

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation