Abstract
In some business applications such as trading management in financial institutions, it is required to accurately answer ad hoc aggregate queries over data streams. Materializing and incrementally maintaining a full data cube or even its compression or approximation over a data stream is often computationally prohibitive. On the other hand, although previous studies proposed approximate methods for continuous aggregate queries, they cannot provide accurate answers. In this paper, we develop a novel prefix aggregate tree (PAT) structure for online warehousing data streams and answering ad hoc aggregate queries. Often, a data stream can be partitioned into the historical segment, which is stored in a traditional data warehouse, and the transient segment, which can be stored in a PAT to answer ad hoc aggregate queries. The size of a PAT is linear in the size of the transient segment, and only one scan of the data stream is needed to create and incrementally maintain a PAT. Although the query answering using PAT costs more than the case of a fully materialized data cube, the query answering time is still kept linear in the size of the transient segment. Our extensive experimental results on both synthetic and real data sets illustrate the efficiency and the scalability of our design.
Similar content being viewed by others
References
Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS′04), Paris, France
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS′02), Madison, WI
Babu S, Widom J (2001) Continuous queries over data streams. SIGMOD Record 30:109–120
Barbara D, Sullivan M (1997) Quasi-cubes: exploiting approximation in multidimensional databases. SIGMOD Record 26:12–17
Barbara D, Wu X (2000) Using loglinear models to compress datacube. In: ‘WAIM′2000’, pp 311–322
Beyer K, Ramakrishnan R (1999) Bottom-up computation of sparse and iceberg cubes. In: Proceedings of the 1999 ACM-SIGMOD international conference on management of data (SIGMOD′99), Philadelphia, PA, pp 359–370
Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: KDD ′03: Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, pp 487–492
Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. SIGMOD Record 26:65–74
Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 2002 international conference on very large data bases (VLDB′02), Hong Kong, China
Cohen S, Nutt W, Serebrenik A (1999) Rewriting aggregate queries using views. In: Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Philadelphia, Pennsylvania, ACM Press, pp 155–166
Cormode G, Korn F, Muthukrishnan S, Srivastava D (2003) Finding hierarchical heavy hitters in data streams. In: Proceedings of the 19th international conference on very large data bases (VLDB′03), Berlin, Germany
Cormode G, Muthukrishnan S (2003). What's hot and what's not: tracking most frequent items dynamically. In: PODS ′03: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, ACM Press, New York, NY, USA, pp 296–306
Datar M, Gionis A, Indyk P, Motwani R (n.d.) Maintaining stream statistics over sliding windows (extended abstract), citeseer.nj.nec.com/491746.html
Dobra A, Garofalakis M, Gehrke J, Rastogi R (2002) Processing complex aggregate queries over data streams. In: Proceedings of the 2002 ACM-SIGMOD international conference management of data (SIGMOD′02), Madison, Wisconsin
Gehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continuous data streams. In: Proceedings of the 2001 ACM-SIGMOD international conference management of data (SIGMOD′01), Santa Barbara, CA, pp 13–24
Giannella C, Han J, Pei J, Yu P (2004) Mining frequent patterns in data streams at multiple time granularities. In: Kargupta H, Joshi A, Sivakumar K, Yesha Y (eds) Next generation data mining, AAAI/MIT
Gray J, Bosworth A, Layman A, Pirahesh H (1996) Data cube: a relational operator generalizing group-by, cross-tab and sub-totals. In: Proceedings of the 1996 international conference data engineering (ICDE′96), New Orleans, Louisiana, pp 152–159
Gupta A, Mumick IS, Subrahmanian VS (1993) Maintaining views incrementally. In: Buneman P, Jajodia S (eds) Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington, D.C., ACM Press, pp 157–166
Hahn CJ, Warren SG, London J (1994) Edited synoptic cloud reports from ships and land stations over the globe, 1982–1991. Available at http://cdiac.esd.ornl.gov/.
Han J, Pei J, Dong G, Wang K (2001) Efficient computation of iceberg cubes with complex measures. In: Proceedings of the 2001 ACM-SIGMOD international conference on management of data (SIGMOD′01), Santa Barbara, CA, pp 1–12
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM-SIGMOD international conference management of data (SIGMOD′00), Dallas, TX, pp 1–12
Harinarayan V, Rajaraman A, Ullman JD (1996) Implementing data cubes efficiently. In: Proceedings of the 1996 ACM-SIGMOD international conference on management of data (SIGMOD′96), Montreal, Canada, pp 205–216
Johnson T, Shasha D (1997) Some approaches to index design for cube forests. Bull Tech Comm Data Eng 20:27–35
Karp RM, Papadimitriou CH, Shenker S (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst (TODS) 28(1):51–55
Lakshmanan L, Pei J, Han J (2002) Quotient cube: How to summarize the semantics of a data cube. In: Proceedings of the 2002 international conference very large data bases (VLDB′02), Hong Kong, China
Lashmanan L, Pei J, Zhao Y (2003) QC-Trees: An efficient summary structure for semantic OLAP. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data (SIGMOD′03), San Diego, California
Levy AY, Mendelzon AO, Sagiv Y, Srivastava D (1995) Answering queries using views. In: Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Jose, California, ACM Press, New York, pp 95–104
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 2002 international conference on very large data bases (VLDB′02), Hong Kong, China
Mendelzon AO, Vaisman AA (2000) Temporal queries in OLAP. In: Abbadi AE, Brodie ML, Chakravarthy S, Dayal U, Kamel N, Schlageter G, Whang K-Y (eds) VLDB 2000, Proceedings of the 26th international conference on very large data bases, Cairo, Egypt, Morgan Kaufmann, pp 242–253
Mumick IS, Quass D, Mumick BS (1997) Maintenance of data cubes and summary tables in a warehouse. In: Peckham J (ed) SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, Tucson, Arizona, USA, ACM Press, pp 100–111
Quass D, Gupta A, Mumick IS, Widom J (1996) Making views self-maintainable for data warehousing. In: Proceedings of the 1996 international conference parallel and distributed information systems, Miami Beach, Florida, pp 158–169
Quass D, Widom J (1997) On-line warehouse view maintenance. In: Peckham J (ed) SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, Tucson, Arizona, USA, ACM Press, pp 393–404
Ross K, Srivastava D (1997) Fast computation of sparse datacubes. In: Proceedings of the 1997 international conference very large data bases (VLDB′97), Athens, Greece, pp 116–125
Ross KA, Zaman KA (2000) Optimizing selections over datacubes. In: Statistical and scientific database management, pp 139–152. citeseer.nj.nec.com/article/ross98optimizing. html
Roussopoulos N, Kotidis Y, Roussopoulos M (1997) Cubetree: Organization of and bulk updates on the data cube. In: Peckham J (ed) SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, Tucson, Arizona, USA, ACM Press, pp 89–99
Sarawagi S (1997) Indexing OLAP data. Bull Tech Com Data Eng 20:36–43
Shanmugasundaram J, Fayyad U, Bradley PS (1999) Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, ACM Press, San Diego, California, United States, pp 223–232
Sismanis Y, Roussopoulos N, Deligiannakis A, Kotidis Y (2002) Dwarf: Shrinking the petacube. In: Proceedings of the 2002 ACM-SIGMOD international conference management of data (SIGMOD′02), Madison, Wisconsin
Sristava D, Dar S, Jagadish HV, Levy AV (1996) Answering queries with aggregation using views. In: Proceedings of the 1996 international conference very large data bases (VLDB′96), Bombay, India, pp 318–329
Teng W-G, Chen M-S, Yu PS (2003) A regression-based temporal pattern mining scheme for data streams. In: Proceedings of the 19th international conference on very large data bases (VLDB′03), Berlin, Germany
Vitter JS, Wang M, Iyer BR (1998) Data cube approximation and historgrams via wavelets. In: Proceedings of the 1998 international conference on information and knowledge management (CIKM′98), Washington DC, pp 96–104
Wang W, Lu H, Feng J, Yu JX (2002) Condensed cube: An effective approach to reducing data cube size. In: Proceedings of the 2002 international conference on data engineering (ICDE′02), San Fransisco, CA
Widom J (1995) Research problems in data warehousing. In: Proceedings of the 4th international conference on information and knowledge management, Baltimore, Maryland, pp 25–30
Yu JX, Chong Z, Lu H, Zhou A (2004) False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In: Proceedings of the 30th international conference on very large data bases (VLDB′04), Toronto, ON, Canada
Zhao Y, Deshpande PM, Naughton JF (1997) An array-based algorithm for simultaneous multidimensional aggregates. In: Proceedings of the 1997 ACM-SIGMOD international conference management of data (SIGMOD′97), Tucson, Arizona, pp 159–170
Author information
Authors and Affiliations
Corresponding author
Additional information
Moonjung Cho is a Ph.D. candidate in the Department of Computer Science and Engineering at State University of New York at Buffalo. She obtained her Master from same university in 2003. She has industry experiences as associate researcher for 4 years. Her research interests are in the area of data mining, data warehousing and data cubing. She has received a full scholarship from Institute of Information Technology Assessment in Korea.
Jian Pei received the Ph.D. degree in Computing Science from Simon Fraser University, Canada, in 2002. He is currently an Assistant Professor of Computing Science at Simon Fraser University, Canada. In 2002–2004, he was an Assistant Professor of Computer Science and Engineering at the State University of New York at Buffalo, USA. His research interests can be summarized as developing advanced data analysis techniques for emerging applications. Particularly, he is currently interested in various techniques of data mining, data warehousing, online analytical processing, and database systems, as well as their applications in bioinformatics. His current research is supported in part by Natural Sciences and Engineering Research Council of Canada (NSERC) and National Science Foundation (NSF). He has published over 70 papers in refereed journals, conferences, and workshops, has served in the program committees of over 60 international conferences and workshops, and has been a reviewer for some leading academic journals. He is a member of the ACM, the ACM SIGMOD, and the ACM SIGKDD.
Ke Wang received Ph.D from Georgia Institute of Technology. He is currently a professor at School of Computing Science, Simon Fraser University. Before joining Simon Fraser, he was an associate professor at National University of Singapore. He has taught in the areas of database and data mining. Ke Wang's research interests include database technology, data mining and knowledge discovery, machine learning, and emerging applications, with recent interests focusing on the end use of data mining. This includes explicitly modeling the business goal (such as profit mining, bio-mining and web mining) and exploiting user prior knowledge (such as extracting unexpected patterns and actionable knowledge). He is interested in combining the strengths of various fields such as database, statistics, machine learning and optimization to provide actionable solutions to real life problems. Ke Wang has published in database, information retrieval, and data mining conferences, including SIGMOD, SIGIR, PODS, VLDB, ICDE, EDBT, SIGKDD, SDM and ICDM. He is an associate editor of the IEEE TKDE journal and has served program committees for international conferences including DASFAA, ICDE, ICDM, PAKDD, PKDD, SIGKDD and VLDB.
Rights and permissions
About this article
Cite this article
Cho, M., Pei, J. & Wang, K. Answering ad hoc aggregate queries from data streams using prefix aggregate trees. Knowl Inf Syst 12, 301–329 (2007). https://doi.org/10.1007/s10115-006-0024-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0024-8