Skip to main content

Progressive and Approximate Join Algorithms on Data Streams

  • Chapter
Advanced Query Processing

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 36))

Abstract

In this chapter, we discuss the design and implementation of join algorithms for data streaming systems, wherememory is often limited relative to the data that needs to be processed.We first focus on progressive join algorithms for various data models. We introduce a framework for progressive join processing, called the Result Rate based Progressive Join (RRPJ) framework which can be used for join processing for various data models, and discuss its various instantiations for processing relational, high-dimensional, spatial and XML data.

We then consider progressive and approximate join algorithms. The need for approximate join algorithms is motivated by the observation that users often do not require complete set of answers. Some answers, which we refer to as an approximate result, are often sufficient. Users expect the approximate result to be either the largest possible or the most representative (or both) given the resources available. We discuss the tradeoffs between maximizing quantity and quality of the approximate result. To address the different tradeoffs, we discuss a family of algorithms for progressive and approximate join processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arge, L.A., Procopiuc, O., Ramaswamy, S., Suel, T., Vitter, J.S.: Scalable Sweeping-Based Spatial Join. In: VLDB, pp. 570–581 (1998)

    Google Scholar 

  2. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and Issues in Data Stream Systems. In: PODS, pp. 1–16 (2002)

    Google Scholar 

  3. Babu, S., Widom, J.: Continuous Queries over Data Streams. SIGMOD Record 30(3), 109–120 (2001)

    Article  Google Scholar 

  4. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. In: SIGMOD, pp. 322–331 (1990)

    Google Scholar 

  5. Berchtold, S., Keim, D.A., Kriegel, H.P.: The X-tree: An Index Structure for High-Dimensional Data. In: VLDB, pp. 28–39 (1996)

    Google Scholar 

  6. Böhm, C., Braunmüller, B., Breunig, M.M., Kriegel, H.P.: High Performance Clustering Based on the Similarity Join. In: CIKM, pp. 298–305 (2000)

    Google Scholar 

  7. Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.P.: Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. In: SIGMOD, pp. 379–388 (2001)

    Google Scholar 

  8. Böhm, C., Krebs, F.: Supporting KDD Applications by the k-Nearest Neighbor Join. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 504–516. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Böhm, C., Krebs, F.: The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Knowl. Inf. Syst. 6(6), 728–749 (2004)

    Article  Google Scholar 

  10. Brinkhoff, T., Kriegel, H.P., Seeger, B.: Efficient Processing of Spatial Joins Using R-Trees. In: SIGMOD, pp. 237–246 (1993)

    Google Scholar 

  11. Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Monitoring Streams - A New Class of Data Management Applications. In: VLDB, pp. 215–226 (2002)

    Google Scholar 

  12. Chaudhuri, S., Motwani, R., Narasayya, V.R.: On Random Sampling over Joins. In: SIGMOD, pp. 263–274 (1999)

    Google Scholar 

  13. Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley (1977)

    Google Scholar 

  14. Das, A., Gehrke, J., Riedewald, M.: Approximate Join Processing Over Data Streams. In: SIGMOD, pp. 40–51 (2003)

    Google Scholar 

  15. Das, A., Gehrke, J., Riedewald, M.: Semantic Approximation of Data Stream Joins. IEEE Trans. Knowl. Data Eng. 17(1), 44–59 (2005)

    Article  Google Scholar 

  16. Dittrich, J.P., Seeger, B., Taylor, D.S., Widmayer, P.: Progressive Merge Join: A Generic and Non-blocking Sort-based Join Algorithm. In: VLDB, pp. 299–310 (2002)

    Google Scholar 

  17. Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. In: SIGMOD, pp. 47–57 (1984)

    Google Scholar 

  18. Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T., Haas, P.J.: Interactive data Analysis: The Control Project. IEEE Computer 32(8), 51–59 (1999)

    Article  Google Scholar 

  19. Hong, M., Demers, A., Gehrke, J., Koch, C., Riedewald, M., White, W.: Massively Multi-Query Join Processing in Publish/Subscribe Systems. In: SIGMOD. ACM Press, Beijing (2007)

    Google Scholar 

  20. Huang, Y.W., Jing, N., Rundensteiner, E.: Spatial Joins using R-trees: Breadth-first Traversal with Global Optimizations. In: VLDB, pp. 396–405 (1997)

    Google Scholar 

  21. Ibrahim, I.K.: Handbook of Research on Mobile Multimedia (N/A). IGI Publishing, Hershey (2006)

    Google Scholar 

  22. Kalashnikov, D.V., Prabhakar, S.: Fast Similarity Join for Multi-Dimensional Data. Inf. Syst. 32(1), 160–177 (2007)

    Article  Google Scholar 

  23. Koudas, N., Sevcik, K.C.: Size Separation Spatial Join. In: SIGMOD, pp. 324–335 (1997)

    Google Scholar 

  24. Koudas, N., Sevcik, K.C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation. In: ICDE, pp. 466–475 (1998)

    Google Scholar 

  25. Koudas, N., Sevcik, K.C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation. IEEE Transactions on Knowledge and Data Engineering 12(1), 3–18 (2000)

    Article  Google Scholar 

  26. Lawrence, R.: Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results. In: VLDB, pp. 841–852 (2005)

    Google Scholar 

  27. Li, F., Chang, C., Kollios, G., Bestavros, A.: Characterizing and Exploiting Reference Locality in Data Stream Applications. In: ICDE, p. 81 (2006)

    Google Scholar 

  28. Lin, J.: Divergence Measures based on the Shannon Entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991)

    Article  MATH  Google Scholar 

  29. Lo, M.L., Ravishankar, C.V.: Spatial Joins Using Seeded Trees. In: SIGMOD, pp. 209–220 (1994)

    Google Scholar 

  30. Lo, M.L., Ravishankar, C.V.: Spatial Hash-Joins. In: SIGMOD, pp. 247–258 (1996)

    Google Scholar 

  31. Mamoulis, N., Papadias, D.: Integration of Spatial Join Algorithms for Joining Multiple Inputs. In: SIGMOD, pp. 1–12 (1999)

    Google Scholar 

  32. Mokbel, M.F., Lu, M., Aref, W.G.: Hash-Merge Join: A Non-blocking Join Algorithm for Producing Fast and Early Join Results. In: ICDE, pp. 251–263 (2004)

    Google Scholar 

  33. Nelson, R.C., Samet, H.: A Population Analysis for Hierarchical Data Structures. In: Dayal, U., Traiger, I.L. (eds.) SIGMOD, pp. 270–277. ACM Press, New York (1987)

    Google Scholar 

  34. Patel, J.M., DeWitt, D.J.: Partition Based Spatial-Merge Join. In: SIGMOD, pp. 259–270 (1996)

    Google Scholar 

  35. Sellis, T., Roussopoulos, N., Faloutsos, C.: R+-tree: A Dynamic Index for Multi-Dimensional Objects. In: VLDB (1987)

    Google Scholar 

  36. Sevcik, K.C., Koudas, N.: Filter Trees for Managing Spatial Data over a Range of Size Granularities. In: VLDB, pp. 16–27 (1996)

    Google Scholar 

  37. Shim, K., Srikant, R., Agrawal, R.: High-Dimensional Similarity Joins. In: ICDE, pp. 301–311 (1997)

    Google Scholar 

  38. Srivastava, U., Widom, J.: Memory-Limited Execution of Windowed Stream Joins. In: VLDB, pp. 324–335 (2004)

    Google Scholar 

  39. Stark, M., Fernández, M., Michiels, P., Siméon, J.: XQuery streaming á la Carte. In: ICDE (2007)

    Google Scholar 

  40. Tao, Y., Yiu, M.L., Papadias, D., Hadjieleftheriou, M., Mamoulis, N.: RPJ: Producing Fast Join Results on Streams through Rate-based Optimization. In: SIGMOD, pp. 371–382 (2005)

    Google Scholar 

  41. Tok, W.H., Bressan, S., Lee, M.L.: Progressive Spatial Joins. In: SSDBM, pp. 353–358 (2006)

    Google Scholar 

  42. Tok, W.H., Bressan, S., Lee, M.-L.: Danaïdes: Continuous and Progressive Complex Queries on RSS Feeds. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 1115–1118. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  43. Tok, W.H., Bressan, S., Lee, M.-L.: Progressive High-Dimensional Similarity Join. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 233–242. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  44. Tok, W.H., Bressan, S., Lee, M.-L.: RRPJ: Result-Rate Based Progressive Relational Join. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 43–54. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  45. Tok, W.H., Bressan, S., Lee, M.-L.: Twig’n Join: Progressive Query Processing of Multiple XML Streams. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 546–553. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  46. Urhan, T., Franklin, M.J.: XJoin: Getting Fast Answers From Slow and Bursty Networks. Tech. Rep. CS-TR-3994, University of Maryland (1999), http://citeseer.nj.nec.com/urhan99xjoin.html

  47. Urhan, T., Franklin, M.J., Amsaleg, L.: Cost Based Query Scrambling for Initial Delays. In: Haas, L.M., Tiwary, A. (eds.) SIGMOD, pp. 130–141. ACM Press (1998)

    Google Scholar 

  48. Vitter, J.S.: Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  49. Wilschut, A.N., Apers, P.M.G.: Dataflow Query Execution in a Parallel Main-Memory Environment. In: PDIS, pp. 68–77 (1991)

    Google Scholar 

  50. Xia, C., Lu, H., Ooi, B.C., Hu, J.: Gorder: An Efficient Method for KNN Join Processing. In: VLDB, pp. 756–767 (2004)

    Google Scholar 

  51. Xie, J., Yang, J., Chen, Y.: On Joining and Caching Stochastic Streams. In: SIGMOD, pp. 359–370 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wee Hyong Tok .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Tok, W.H., Bressan, S. (2013). Progressive and Approximate Join Algorithms on Data Streams. In: Catania, B., Jain, L. (eds) Advanced Query Processing. Intelligent Systems Reference Library, vol 36. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28323-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28323-9_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28322-2

  • Online ISBN: 978-3-642-28323-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics