Abstract
Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine global frequent itemsets from a collection of data streams distributed at distinct remote sites. To speed up the mining process, we make the first attempt to address a new problem on continuously maintaining a global synopsis for the union of all the distributed streams. The mining results therefore can be yielded on demand by directly processing the maintained global synopsis. Instead of collecting and processing all the data in a central server, which may waste the computation resources of remote sites, distributed computations over the data streams are performed. A distributed computation framework is proposed in this paper, including two communication strategies and one merging operation. These communication strategies are designed according to an accuracy guarantee of the mining results, determining when and what the remote sites should transmit to the central server (named coordinator). On the other hand, the merging operation is exploited to merge the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and operation, the goal of continuously maintaining the global synopsis can be achieved. Rooted in the continuously maintained global synopsis, we propose a mining algorithm for finding global frequent itemsets. Moreover, the correctness guarantees of the communication strategies and merging operation, and the accuracy guarantee analysis of the mining algorithm are provided. Finally, a series of experiments on synthetic datasets and a real dataset are performed to show the effectiveness and efficiency of the distributed computation framework.
Similar content being viewed by others
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large databases (VLDB 1994). Santiago, Chile, pp 487–499
Arlitt M, Williamson C (1996) Web server workload characterization: the search for invariants. In: Proceedings performance evaluation review, vol 24, pp 126–137
Babcock B, Olston C (2003) Distributed top-K monitoring. In: Halevy AY, Ives ZG, Doan AH (eds) Proceedings of the 2003 ACM SIGMODE international conference on management of data (SIGMOD 2003). San Diego, California, USA, pp 28–39
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Widmayer P, Ruis FT, Bueno RM, Hennessy M, Eidenbenz S, Conejo R (eds) Proceedings of the 29th international colloquium on automata, languages and programming (ICALP’02). Málaga, Spain, pp 693–703
Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: Proceedings of the seventh IEEE international conference on data mining (ICDM’07). Omaha, USA, pp 83–92
Cormode G, Garofalakis M (2005) Sketching streams through the net: distributed approximate query tracking. In: Böhm K, Jensen CS, Haas LM, Kersten ML, Larson PÅ, and Ooi BC (eds) Proceedings of the 31st international conference on very large data bases (VLDB 2005). Trondheim, Norway, pp 13–24
Cormode G, Garofalakis M, Muthukrishnan S, Rastogi R (2005) Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Özcan F (ed) Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD 2005). Baltimore, Maryland, USA, pp 25–36
Cormode G, Hadjieleftheriou M (2010) Methods for finding frequent items in data streams. VLDB J 19(1): 3–20
Cheng J, Ke Y, Ng W (2006) Maintaining frequent itemsets over high-speed data streams. In: Ng WK, Kitsuregawa M, Li J, Chang K (eds) Proceedings of the 10th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2006). Singapore, pp 462–467
Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery in databases and data mining (KDD 2003). Washington, DC, USA, pp 487–492
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count- min sketch and its applications. J Algorithm 55(1): 58–75
Cormode G, Muthukrishnan S, Zhuang W (2006) What’s different: distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: Liu L, Reuter A, Whang KY, Zhang J (eds) Proceedings of the 22nd international conference on data engineering (ICDE’06). Atlanta, GA, USA, pp 57–57
Cormode G, Muthukrishnan S, Zhuang W (2007) Conquering the divide: continuous clustering of distributed data streams. In: Proceedings of the 23rd international conference on data engineering (ICDE’07), Istanbul, Turkey, pp 1036–1045
Das A, Ganguly S, Garofalakis M, Rastogi R (2004) Distributed set-expression cardinality estimation. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, and Schiefer KB (eds) Proceedings of the thirtieth international conference on very large data bases (VLDB 2004), Toronto, Canada, pp 312–323
Demaine E, Lopez-Ortiz A, Munro JI (2002) Frequency estimation of Internet packet streams with limited space. In: Möhring RH, Raman R (eds) Proceedings of the 10th European symposium on algorithms (ESA 2002), Rome, Italy, pp 348–360
Dang XH, Ng WK, Ong KL (2008) Online mining of frequent sets in data streams with error guarantee. Knowl Inf Syst 16(2): 245–258
Fuller R, Kantardzic M (2008) Distributed monitoring of frequent items. Trans MLDM 1(2): 67–82
Fischer MJ, Salzberg SL (1982) Finding a majority among N votes: solution to problem 81-5. J Algorithm 3(4): 362–380
Giannella C, Han J, Pei J, Yan X, Yu PS (2004) Mining frequent patterns in data streams at multiple time granularities. In: Kargupta H, Joshi A, Sivakumar K, Yesha Y (eds) Data mining next generation challenges and future directions. AAAI Press, Menlo Park, pp 191–212
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Disc 8(1): 53–87
Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05), Houston, Texas, USA, pp 210–217
Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12th ACM international conference on information and knowledge management (CIKM’03), New Orleans, LA, USA, pp 287–294
Keralapura R, Cormode G, Ramamirtham J (2006) Communication-efficient distributed monitoring of thresholded counts. In: Chaudhuri S, Hristidis V, Polyzotis B (eds) Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD 2006). Chicago, Illinois, USA, pp 289–300
Karp RM, Papadimitriou CH, Shenker S (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1): 51–55
Kashyap S, Ramamirtham J, Rastoqi R, and Shukla P (2008) Efficient constraint monitoring using adaptive thresholds. In: Proceedings of IEEE 24th international conference on data engineering (ICDE’08), Cancún, México, pp 526–535
Lin CH, Chiu DY, Wu YH, Chen ALP (2005) Mining frequent itemsets from data streams with a time-sensitive sliding window. 2005 SIAM international conference on data mining (SDM’05), Newport Beach, CA
Leung CKS, Khan Q (2006) DSTree: A tree structure for the mining of frequent sets from data streams. In: Proceedings of the sixth IEEE international conference on data mining (ICDM’06), Hong Kong, China, pp 928–932
Li HF, Lee SY, Shan MK (2004) An efficient algorithm for mining frequent itemsets over the entire history of data streams. The first international workshop on knowledge discovery in data streams, in conjunction with ECML/PKDD 2004, Pisa, Italy
Metwally A, Agrawal D, Abbadi AE (2005) Efficient computation of frequent and top-k elements in data streams. In: Eiter T, Libkin L (eds) Proceedings of the 10th international conference on database theory (ICDT2005). Edinburgh, UK, pp 398–412
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large databases (VLDB 2002), Hong Kong, China, pp 346–357
Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of IEEE 21st international conference on data engineering (ICDE’05), Tokyo, Japan, pp 767–778
Mozafari B, Thakkar H, Zaniolo C (2008) Verifying and mining frequent patterns from large windows over data streams. In: Proceedings of IEEE 24th international conference on data engineering (ICDE’08), Cancún, México, pp 179–188
Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large database. In: Dayal U, Gray PMD, Nishio S (eds) Proceedings of the 21st international conference on very large databases (VLDB 1995). Zurich, Switzerland, pp 432–444
Wang ET, Chen ALP (2009) A novel Hash-based approach for mining frequent itemsets over data streams requiring less memory space. Data Min Knowl Disc 19(1): 132–172
Yu JX, Chong Z, Lu H, Zhou A (2004) False positive or false negative: mining frequent itemsets from high speed transactional data streams. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB (eds) Proceedings of the 30th international conference on very large databases (VLDB 2004). Toronto, Canada, pp 204–215
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: M.J. Zaki.
Rights and permissions
About this article
Cite this article
Wang, E.T., Chen, A.L.P. Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis. Data Min Knowl Disc 23, 252–299 (2011). https://doi.org/10.1007/s10618-010-0204-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0204-8