Abstract.
Caching query results is one efficient approach to improving the performance of XML management systems. This entails the discovery of frequent XML queries issued by users. In this paper, we model user queries as a stream of XML query pattern trees and mine the frequent query patterns over the query stream. To facilitate the one-pass mining process, we devise a novel data structure called DTS to summarize the pattern trees seen so far. By grouping the incoming pattern trees into batches, we can dynamically mark the active portion of the current batch in DTS and limit the enumeration of candidate trees to only the currently active pattern trees. We also design another summary data structure called ECTree that provides for the incremental computation of the frequent tree patterns over the query stream. Based on the above two constructs, we present two mining algorithms called XQSMinerI and XQSMinerII. XQSMinerI is fast, but it tends to overestimate, while XQSMinerII adopts a filter-and-refine approach to minimize the amount of overestimation. Experimental results show that the proposed methods are both efficient and scalable and require only small memory footprints.
Similar content being viewed by others
References
Abadi DJ, Carney D, Cetintemel U, Cherniack M, Convey C, Lee S, Stonebraker M, Tatbul N, Zdonik S (2003) Aurora: a new model and architecture for data stream management. VLDB J 12(2):120-139
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: VLDB, pp 487-499
Arasu A, Babcock B, Babu S, McAlister J, Widom J (2002) Characterizing memory requirements for queries over continuous data streams. In: ACM PODS, pp 221-232
Asai T, Arimura H(2002) Online algorithms for mining semi-structured data stream. In: ICDM, pp 27-34
Asai T, Abe K, Kawasoe S(2002) Efficient substructure discovery from large semi-structured data. In: 2nd SIAM international conference on data mining
Babcock B, Datar M, Motwani R (2002) Sampling from a moving window over streaming data. In: SODA, pp 633-634
Babcock B, Babu S, Datar M, Motwani R (2003) Chain: operator scheduling for memory minimization in data stream systems. In: ACM SIGMOD, pp 253-264
Carney D, Cetintemel U, Rasin A, Zdonik SB, Cherniack M, Stonebraker M (2003) Operator scheduling in a data stream manager. In: VLDB, pp 838-849
Chandrasekaran S, Cooper O, Deshpande A, Franklin MJ, Hellerstein JM, Hong W, Krishnamurthy S, Madden S, Raman V, Reiss F, Shah MA (2003) Telegraphcq: continuous dataflow processing for an uncertain world. In: CIDR
Charikar M, Chen K, Colton MF (2002) Finding frequent items in data streams. In: 29th international colloquium on automata, languages and programming, pp 693-703
Chen L, Rundensteiner EA, Wang S (2002) Xcache - a semantic caching system for XML queries. In: ACM SIGMOD, pp 618
Cormode G, Datar M, Indyk P, Muthukrishnan S (2002) Comparing data streams using hamming norms (how to zero in). In: VLDB, pp 335-345
Cranor C, Johnson T, Spataschek O, Shkapenyuk V (2003) Gigascope: a stream database for network applications. In: ACM SIGMOD
Das A, Gehrke J, Riedewald M (2003) Approximate join processing over data streams. In: ACM SIGMOD, pp 40-51
Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. In: SODA, pp 635-644
Dobra A, Garofalakis M, Gehrke J, Rastogi R (2002) Processing complex aggregate queries over data streams. In: ACM SIGMOD, pp 61-72
Domingos P, Hulten G (2000) Mining high-speed data streams. In: ACM SIGKDD, pp 71-80
Ganguly S, Garofalakis M, Rastogi R (2003) Processing set expressions over continuous update streams. In: ACM SIGMOD, pp 265-276
Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: ACM SIGMOD, pp 331-342
Gibbons PB, Matias Y (1999) Synopsis data structures for massive data sets. DIMACS Series in Discrete Mathematics and Theoretical Computer Science: Special Issue on External Memory Algorithms and Visualization, pp 39-70
Gilbert AC, Kotidis Y, Muthukrishnan S, Strauss M (2001) Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: VLDB, pp 79-88
Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):515--528
Hidber C (1999) Online association rule mining. In: ACM SIGMOD, pp 145--156
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: ACM SIGKDD, pp 97-106
Luccio F, Enriquez AM, Rieumont PO, Pagli L (2001) Exact rooted subtree matching in sublinear time. Technical report, University of Pisa, Italy. ftp://ftp.di.unipi.it/pub/techreports/TR-01-14.ps.Z
Madden S, Shah M, Hellerstein JM, Raman V (2002) Continuously adaptive continuous queries over streams. In: ACM SIGMOD, pp 49-60
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: VLDB, pp 346-357
Mazlack L (2001) Granulation of quantitative association rules. Int J Fuzzy Sys 3(3):400-408
Miklau G, Suciu D (2002) Containment and equivalence for an XPath fragment. In: ACM PODS, pp 65-76
Motwani R, Widom J, Arasu A, Babcock B, Babu S, Datar M, Manku G, Olston C, Rosenstein J, Varma R (2003) Query processing, resource management, and approximation in a data stream management system. In: CIDR
Naughton JF, DeWitt DJ, Maier D(2001) The Niagara Internet query system. IEEE Data Eng Bull 24(2):27-33
Neven F, Schwentick T (2003) XPath containment in the presence of disjunction, DTDs, and variables. In: ICDT, pp 330-345
Papadimitriou S, Brockwell A, Faloutsos C (2003) Adaptive, hands-off stream mining. In: VLDB, pp 560-571
Savasere A, Omiecinski E, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. In: VLDB, pp 432-444
Schwentick T (2004) XPath query containment. In: ACM SIGMOD Record 33(1):101-109
Shasha D, Wang JTL, Giugno R (2002) Algorithmics and applications of tree and graph searching. In: ACM PODS, pp 39-52
Srikant R, Agrawal R (1996) Mining quantitative association rules in large relatioal tables. In: ACM SIGMOD, pp 1-12
Tatbul N, Cetintemel U, Zdonik SB, Cherniack M, Stonebraker M (2003) Load shedding in a data stream manager. In: VLDB
Termier A, Rousset MC, Sebag M (2002) TreeFinder: a first step towards XML data mining. In: IEEE ICDM
Toivonen H (1996) Sampling large database for association rules. In: VLDB, pp 134-145
Wang K, Liu H (2000) Discovering structural association of semistructured data. IEEE TKDE 12(3):353-371
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: ACM SIGKDD, pp 226-235
Wood P (2003) Containment for XPath fragments under DTD constraints. In: ICDT, pp 300-314
XML Path Language (XPath). http://www.w3.org/TR/xpath
Yang LH, Lee ML, Hsu W (2003) Mining frequent query patterns in XML. In: DASFAA, pp 355-362
Yang LH, Lee ML, Hsu W (2003) Efficient mining of frequent query patterns for caching. In: VLDB
Zaki M (2002) Efficiently mining frequent trees in a forest. In: ACM SIGKDD
Zhu Y, Shasha D (2003) Efficient elastic burst detection in data streams. In: ACM SIGKDD, pp 336-345
Author information
Authors and Affiliations
Corresponding author
Additional information
Received: 17 October 2003, Accepted: 16 April 2004, Published online: 14 September 2004
Edited by: J. Gehrke and J. Hellerstein.
Rights and permissions
About this article
Cite this article
Yang, L.H., Lee, M.L. & Hsu, W. Finding hot query patterns over an XQuery stream. VLDB 13, 318–332 (2004). https://doi.org/10.1007/s00778-004-0134-4
Issue Date:
DOI: https://doi.org/10.1007/s00778-004-0134-4