Skip to main content
Log in

PI-Join: Efficiently processing join queries on massive data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The ratio of disk capacity to disk transfer rate typically increases by 10× per decade. As a result, disk is becoming slower from the view of applications because of the much larger data volume that they need to store and process. In database systems, the less the data volume that is involved in query processing, the better the performance that is achieved. Disk-based join operation is a common but time-consuming database operation, especially in an environment of massive data in which I/O cost dominates the execution time. However, current join algorithms are only suitable for moderate or small data volume. They will incur high I/O cost when performing on massive data because of multi-pass I/O operations on the joined tables and the insensitivity to join selectivity. This paper proposes PI-Join a novel disk-based join algorithm that can efficiently process join queries involving massive data. PI-Join consists of two stages: JPIPT construction stage (JCS) and result output stage (ROS). JCS performs a cache-conscious construction algorithm on join attributes which are kept in column-oriented model to obtain join positional index pair table (JPIPT) of join results faster. The obtained JPIPT is used in ROS to retrieve results in a one-pass sequential selective scan on each table. We provide the correctness proof and cost analysis of PI-Join. Our experimental results indicate that PI-Join has a significant advantage over the existing join algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Ailamaki A, DeWitt D, Hill M, Wood D (1999) DBMSs on a modern processor: where does time go? In: Proceedings of the 25th international conference on very large data bases, pp 266–277

  2. An J, Chen H, Furuse K, Ohbo N (2005) CVA file: an index structure for high-dimensional datasets. Knowl Inf Syst 7(3): 337–357

    Article  Google Scholar 

  3. Bentley J (2000) Programming pearls. 2nd edn. ACM Press/Addison-Wesley, New York

    Google Scholar 

  4. Böhm C, Krebs F (2004) The k-nearest neighbour join: turbo charging the KDD process. Knowl Inf Syst 6(6): 728–749

    Article  Google Scholar 

  5. Boncz P, Manegold S, Kersten M (1999) Database architecture optimized for the new bottleneck: memory access. In: Proceedings of the 25th international conference on very large data bases, pp 54–65

  6. Chen H, Liu J, Furuse K, Yu J, Ohbo N (2011) Indexing expensive functions for efficient multi-dimensional similarity search. Knowl Inf Syst 27(2): 165–192

    Article  Google Scholar 

  7. Chen S, Ailamaki A, Gibbons P, Mowry T (2004) Improving hash join performance through prefetching. In: Proceedings of the 20th international conference on data engineering, pp 116–127

  8. Chen S, Ailamaki A, Gibbons P, Mowry T (2007) Improving hash join performance through prefetching. ACM Trans Database Syst 32(3): 17

    Article  Google Scholar 

  9. DeWitt D, Katz R, Olken F, Shapiro L, Stonebraker M, Wood D (1984) Implementation techniques for main memory database systems. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 1–8

  10. DeWitt D, Naughton J, Schneider D (1991) An evaluation of non-equijoin algorithms. In: Proceedings of the 17th international conference on very large data bases, pp 443–452

  11. Garcia-Molina H, Ullman J, Widom J (2008) Database systems: the complete book, 2nd edn. Prentice Hall, Upper Saddle River

    Google Scholar 

  12. Graefe G (1992) Five performance enhancements for hybrid hash join. Technical report, CU-CS-606-92, University of Colorado at Boulder

    Google Scholar 

  13. Graefe G (1993) Query evaluation techniques for large databases. ACM Comput Surv 25(2): 73–170

    Article  Google Scholar 

  14. Gray J, Shenoy P (2000) Rules of thumb in data engineering. In: Proceedings of the 16th international conference on data engineering, pp 3–12

  15. He B, Luo Q (2006) Cache-oblivious nested-loop joins. In: Proceedings of the 2006 ACM CIKM international conference on information and knowledge management, pp 718–727

  16. He B, Luo Q (2007) Cache-oblivious query processing. In: Proceedings of 3rd biennial conference on innovative data systems research, pp 44–55

  17. He B, Luo Q (2008) Cache-oblivious databases: limitations and opportunities. ACM Trans Database Syst 33(2): 8

    Article  Google Scholar 

  18. Kamath M, Ramamritham K Bucket skip merge join: a scalable algorithm for join processing in very large databases using indexes. Technical report, CS-TR-96-20, University of Massachusetts

  19. Kim C, Sedlar E, Chhugani J et al (2009) Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs. Proc VLDB Endow 2(2): 1378–1389

    Google Scholar 

  20. Kranen P, Assent I, Baldauf C, Seidl T (2010) The ClusTree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst 1–24. doi:10.1007/s10115-010-0342-8

  21. Li Z, Ross K (1999) Fast joins using join indices. VLDB J 8(1): 1–24

    Article  MATH  Google Scholar 

  22. Manegold S, Boncz P, Kersten M (2000) What happens during a join? Dissecting CPU and Memory optimization effects. In: Proceedings of the 26th international conference on very large data bases, pp 339–350

  23. Mishra P, Eich M (1992) Join processing in relational databases. ACM Comput Surv 24(1): 63–113

    Article  Google Scholar 

  24. Nyberg C, Barclay T, Cvetanovic Z, Gray J, Lomet D (1995) AlphaSort: a cache-sensitive parallel external sor. VLDB J 4(4): 603–627

    Article  Google Scholar 

  25. O’Neil P, Graefe G (1995) Multi-table joins through bitmapped join indices. SIGMOD Rec 24(3): 8–11

    Article  Google Scholar 

  26. Oracle (2010) Oracle database 11g release 2 (11.2) documentation. http://www.oracle.com/pls/db112/homepage

  27. Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3): 1065–1076

    Article  MathSciNet  MATH  Google Scholar 

  28. Raykar V, Duraiswami R (2006) Fast optimal bandwidth selection for kernel density estimation. In: Proceedings of the 6th SIAM international conference on data mining, pp 524–528

  29. Selinger P, Astrahan M, Chamberlin D, Lorie R, Price T (1979) Access path selection in a relational database management system. In: Proceedings of the 1979 ACM SIGMOD international conference on management of data, pp 23–34

  30. Shahabi C, Khan L, McLeod D (2000) A probe-based technique to optimize join queries in distributed internet databases. Knowl Inf Syst 2(3): 373–385

    Article  Google Scholar 

  31. Shahabi C, Kolahdouzan M, Safar M (2004) Alternative strategies for performing spatial joins on web sources. Knowl Inf Syst 6(3): 290–314

    Article  Google Scholar 

  32. Shapiro L (1986) Join processing in database systems with large main memories. ACM Trans Database Syst 11(3): 239–264

    Article  Google Scholar 

  33. Shatdal A, Kant C, Naughton J (1994) Cache conscious algorithms for relational query processing. In: Proceedings of the 20th international conference on very large data bases, pp 510–521

  34. Stonebraker M, Abadi D, Batkin A et al (2005) C-store: a column-oriented DBMS. In: Proceedings of the 31st international conference on very large data bases, pp 553–564

  35. Tan K, Goh C, Lee M, Ooi B (1999) Efficient join processing using partial precomputation. Knowl Inf Syst 1(4): 481–514

    Google Scholar 

  36. Tom’s Hardware (2006) Hard drives: 40 MB to 750 GB—3,500 to 10,000 RPM. http://www.tomshardware.com/reviews/15-years-of-hard-drive-history,1368-2.html

  37. Valduriez P (1987) Join indices. ACM Trans Database Syst 12(2): 218–246

    Article  Google Scholar 

  38. Wong H, Liu H, Olken F, Rotem D, Wong L (1985) Bit transposed files. In: Proceedings of the 11th international conference on very large data bases, pp 448–457

  39. Wong H, Li J, Olken F, Rotem D, Wong L (1986) Bit transposition for very large scientific and statistical databases. Algorithmica 1(3): 289–309

    Article  MathSciNet  Google Scholar 

  40. Zhang M, Alhajj R (2010) Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 1–26

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donghua Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, X., Li, J. & Yang, D. PI-Join: Efficiently processing join queries on massive data. Knowl Inf Syst 32, 527–557 (2012). https://doi.org/10.1007/s10115-011-0429-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0429-x

Keywords

Navigation