Processing Data-Stream Join Aggregates Using Skimmed Sketches

Ganguly, Sumit; Garofalakis, Minos; Rastogi, Rajeev

doi:10.1007/978-3-540-24741-8_33

Sumit Ganguly¹¹,
Minos Garofalakis¹¹ &
Rajeev Rastogi¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2992))

Included in the following conference series:

International Conference on Extending Database Technology

2139 Accesses
21 Citations

Abstract

There is a growing interest in on-line algorithms for analyzing and querying data streams, that examine each stream element only once and have at their disposal, only a limited amount of memory. Providing (perhaps approximate) answers to aggregate queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. In this paper, we present the skimmed-sketch algorithm for estimating the join size of two streams. (Our techniques also readily extend to other join-aggregate queries.) To the best of our knowledge, our skimmed-sketch technique is the first comprehensive join-size estimation algorithm to provide tight error guarantees while: (1) achieving the lower bound on the space required by any join-size estimation method in a streaming environment, (2) handling streams containing general update operations (inserts and deletes), (3) incurring a low logarithmic processing time per stream element, and (4) not assuming any a-priori knowledge of the frequency distribution for domain values. Our skimmed-sketch technique achieves all of the above by first skimming the dense frequencies from random hash-sketch summaries of the two streams. It then computes the subjoin size involving only dense frequencies directly, and uses the skimmed sketches only to approximate subjoin sizes for the non-dense frequencies. Results from our experimental study with real-life as well as synthetic data streams indicate that our skimmed-sketch algorithm provides significantly more accurate estimates for join sizes compared to earlier sketch-based techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, California (2001)
Google Scholar
Gilbert, A., Kotidis, Y., Muthukrishnan, S., Strauss, M.: How to Summarize the Universe: Dynamic Maintenance of Quantiles. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong (2002)
Google Scholar
Alon, N., Matias, Y., Szegedy, M.: The Space Complexity of Approximating the Frequency Moments. In: Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, Philadelphia, Pennsylvania, pp. 20–29 (1996)
Google Scholar
Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking Join and Self-Join Sizes in Limited Storage. In: Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadeplphia, Pennsylvania (1999)
Google Scholar
Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing Complex Aggregate Queries over Data Streams. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin (2002)
Google Scholar
Gibbons, P.: Distinct Sampling for Highly-accurate Answers to Distinct Values Queries and Event Reports. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy (2001)
Google Scholar
Cormode, G., Datar, M., Indyk, P., Muthukrishnan, S.: Comparing Data Streams Using Hamming Norms. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong (2002)
Google Scholar
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on Automata Languages and Programming (2002)
Google Scholar
Cormode, G., Muthukrishnan, S.: What’s Hot and What’s Not:Tracking Most Frequent Items Dynamically. In: Proceedings of the Twentysecond ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, San Diego, California (2003)
Google Scholar
Manku, G., Motwani, R.: Approximate Frequency Counts over Data Streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong (2002)
Google Scholar
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: Surfing Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy (2001)
Google Scholar
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining Stream Statistics over Sliding Windows. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California (2002)
Google Scholar
Vitter, J.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 37–57 (1985)
Article MATH MathSciNet Google Scholar
Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join Synopses for Approximate Query Answering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, pp. 275–286 (1999)
Google Scholar
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate Query Processing Using Wavelets. In: Proceedings of the 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp. 111–122 (2000)
Google Scholar
Ganguly, S., Gibbons, P., Matias, Y., Silberschatz, A.: Bifocal Sampling for Skew-Resistant Join Size Estimation. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec (1996)
Google Scholar
Ganguly, S., Garofalakis, M., Rastogi, R.: Processing Data-Stream Join Aggregates Using Skimmed Sketches. Bell Labs Tech. Memorandum (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Bell Laboratories, Lucent Technologies, Murray Hill, NJ, USA
Sumit Ganguly, Minos Garofalakis & Rajeev Rastogi

Authors

Sumit Ganguly
View author publications
You can also search for this author in PubMed Google Scholar
Minos Garofalakis
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Rastogi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Purdue University,
Elisa Bertino
Laboratory of Distributed Multimedia Information Systems and Applications, Technical University of Crete (MUSIC/TUC) Chania, 73100, Crete, Greece
Stavros Christodoulakis
Institute of Computer Science, FO.R.T.H., Vassilika Vouton, P.O. Box 1385, GR 71110, Heraklion, Greece
Dimitris Plexousakis
Department of Computer Science, University of Crete, P.O.Box 2208, GR 71409, Heraklion, Greece
Vassilis Christophides
National and Kapodistrian University of Athens, Greece
Manolis Koubarakis
IPD, Universität Karlsruhe, Am Fasanengarten 5, 76131, Karlsruhe,
Klemens Böhm
Department of Computer Science and Communication, University of Insubria, 22100, Varese, Italy
Elena Ferrari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ganguly, S., Garofalakis, M., Rastogi, R. (2004). Processing Data-Stream Join Aggregates Using Skimmed Sketches. In: Bertino, E., et al. Advances in Database Technology - EDBT 2004. EDBT 2004. Lecture Notes in Computer Science, vol 2992. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24741-8_33

Download citation

DOI: https://doi.org/10.1007/978-3-540-24741-8_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21200-3
Online ISBN: 978-3-540-24741-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics