Abstract
The ratio of disk capacity to disk transfer rate typically increases by 10× per decade. As a result, disk is becoming slower from the view of applications because of the much larger data volume that they need to store and process. In database systems, the less the data volume that is involved in query processing, the better the performance that is achieved. Disk-based join operation is a common but time-consuming database operation, especially in an environment of massive data in which I/O cost dominates the execution time. However, current join algorithms are only suitable for moderate or small data volume. They will incur high I/O cost when performing on massive data because of multi-pass I/O operations on the joined tables and the insensitivity to join selectivity. This paper proposes PI-Join a novel disk-based join algorithm that can efficiently process join queries involving massive data. PI-Join consists of two stages: JPIPT construction stage (JCS) and result output stage (ROS). JCS performs a cache-conscious construction algorithm on join attributes which are kept in column-oriented model to obtain join positional index pair table (JPIPT) of join results faster. The obtained JPIPT is used in ROS to retrieve results in a one-pass sequential selective scan on each table. We provide the correctness proof and cost analysis of PI-Join. Our experimental results indicate that PI-Join has a significant advantage over the existing join algorithms.
Similar content being viewed by others
References
Ailamaki A, DeWitt D, Hill M, Wood D (1999) DBMSs on a modern processor: where does time go? In: Proceedings of the 25th international conference on very large data bases, pp 266–277
An J, Chen H, Furuse K, Ohbo N (2005) CVA file: an index structure for high-dimensional datasets. Knowl Inf Syst 7(3): 337–357
Bentley J (2000) Programming pearls. 2nd edn. ACM Press/Addison-Wesley, New York
Böhm C, Krebs F (2004) The k-nearest neighbour join: turbo charging the KDD process. Knowl Inf Syst 6(6): 728–749
Boncz P, Manegold S, Kersten M (1999) Database architecture optimized for the new bottleneck: memory access. In: Proceedings of the 25th international conference on very large data bases, pp 54–65
Chen H, Liu J, Furuse K, Yu J, Ohbo N (2011) Indexing expensive functions for efficient multi-dimensional similarity search. Knowl Inf Syst 27(2): 165–192
Chen S, Ailamaki A, Gibbons P, Mowry T (2004) Improving hash join performance through prefetching. In: Proceedings of the 20th international conference on data engineering, pp 116–127
Chen S, Ailamaki A, Gibbons P, Mowry T (2007) Improving hash join performance through prefetching. ACM Trans Database Syst 32(3): 17
DeWitt D, Katz R, Olken F, Shapiro L, Stonebraker M, Wood D (1984) Implementation techniques for main memory database systems. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 1–8
DeWitt D, Naughton J, Schneider D (1991) An evaluation of non-equijoin algorithms. In: Proceedings of the 17th international conference on very large data bases, pp 443–452
Garcia-Molina H, Ullman J, Widom J (2008) Database systems: the complete book, 2nd edn. Prentice Hall, Upper Saddle River
Graefe G (1992) Five performance enhancements for hybrid hash join. Technical report, CU-CS-606-92, University of Colorado at Boulder
Graefe G (1993) Query evaluation techniques for large databases. ACM Comput Surv 25(2): 73–170
Gray J, Shenoy P (2000) Rules of thumb in data engineering. In: Proceedings of the 16th international conference on data engineering, pp 3–12
He B, Luo Q (2006) Cache-oblivious nested-loop joins. In: Proceedings of the 2006 ACM CIKM international conference on information and knowledge management, pp 718–727
He B, Luo Q (2007) Cache-oblivious query processing. In: Proceedings of 3rd biennial conference on innovative data systems research, pp 44–55
He B, Luo Q (2008) Cache-oblivious databases: limitations and opportunities. ACM Trans Database Syst 33(2): 8
Kamath M, Ramamritham K Bucket skip merge join: a scalable algorithm for join processing in very large databases using indexes. Technical report, CS-TR-96-20, University of Massachusetts
Kim C, Sedlar E, Chhugani J et al (2009) Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs. Proc VLDB Endow 2(2): 1378–1389
Kranen P, Assent I, Baldauf C, Seidl T (2010) The ClusTree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst 1–24. doi:10.1007/s10115-010-0342-8
Li Z, Ross K (1999) Fast joins using join indices. VLDB J 8(1): 1–24
Manegold S, Boncz P, Kersten M (2000) What happens during a join? Dissecting CPU and Memory optimization effects. In: Proceedings of the 26th international conference on very large data bases, pp 339–350
Mishra P, Eich M (1992) Join processing in relational databases. ACM Comput Surv 24(1): 63–113
Nyberg C, Barclay T, Cvetanovic Z, Gray J, Lomet D (1995) AlphaSort: a cache-sensitive parallel external sor. VLDB J 4(4): 603–627
O’Neil P, Graefe G (1995) Multi-table joins through bitmapped join indices. SIGMOD Rec 24(3): 8–11
Oracle (2010) Oracle database 11g release 2 (11.2) documentation. http://www.oracle.com/pls/db112/homepage
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3): 1065–1076
Raykar V, Duraiswami R (2006) Fast optimal bandwidth selection for kernel density estimation. In: Proceedings of the 6th SIAM international conference on data mining, pp 524–528
Selinger P, Astrahan M, Chamberlin D, Lorie R, Price T (1979) Access path selection in a relational database management system. In: Proceedings of the 1979 ACM SIGMOD international conference on management of data, pp 23–34
Shahabi C, Khan L, McLeod D (2000) A probe-based technique to optimize join queries in distributed internet databases. Knowl Inf Syst 2(3): 373–385
Shahabi C, Kolahdouzan M, Safar M (2004) Alternative strategies for performing spatial joins on web sources. Knowl Inf Syst 6(3): 290–314
Shapiro L (1986) Join processing in database systems with large main memories. ACM Trans Database Syst 11(3): 239–264
Shatdal A, Kant C, Naughton J (1994) Cache conscious algorithms for relational query processing. In: Proceedings of the 20th international conference on very large data bases, pp 510–521
Stonebraker M, Abadi D, Batkin A et al (2005) C-store: a column-oriented DBMS. In: Proceedings of the 31st international conference on very large data bases, pp 553–564
Tan K, Goh C, Lee M, Ooi B (1999) Efficient join processing using partial precomputation. Knowl Inf Syst 1(4): 481–514
Tom’s Hardware (2006) Hard drives: 40 MB to 750 GB—3,500 to 10,000 RPM. http://www.tomshardware.com/reviews/15-years-of-hard-drive-history,1368-2.html
Valduriez P (1987) Join indices. ACM Trans Database Syst 12(2): 218–246
Wong H, Liu H, Olken F, Rotem D, Wong L (1985) Bit transposed files. In: Proceedings of the 11th international conference on very large data bases, pp 448–457
Wong H, Li J, Olken F, Rotem D, Wong L (1986) Bit transposition for very large scientific and statistical databases. Algorithmica 1(3): 289–309
Zhang M, Alhajj R (2010) Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 1–26
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Han, X., Li, J. & Yang, D. PI-Join: Efficiently processing join queries on massive data. Knowl Inf Syst 32, 527–557 (2012). https://doi.org/10.1007/s10115-011-0429-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0429-x