PI-Join: Efficiently processing join queries on massive data

Han, Xixian; Li, Jianzhong; Yang, Donghua

doi:10.1007/s10115-011-0429-x

PI-Join: Efficiently processing join queries on massive data

Regular Paper
Published: 01 July 2011

Volume 32, pages 527–557, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Xixian Han¹,
Jianzhong Li¹ &
Donghua Yang²

336 Accesses
7 Citations
Explore all metrics

Abstract

The ratio of disk capacity to disk transfer rate typically increases by 10× per decade. As a result, disk is becoming slower from the view of applications because of the much larger data volume that they need to store and process. In database systems, the less the data volume that is involved in query processing, the better the performance that is achieved. Disk-based join operation is a common but time-consuming database operation, especially in an environment of massive data in which I/O cost dominates the execution time. However, current join algorithms are only suitable for moderate or small data volume. They will incur high I/O cost when performing on massive data because of multi-pass I/O operations on the joined tables and the insensitivity to join selectivity. This paper proposes PI-Join a novel disk-based join algorithm that can efficiently process join queries involving massive data. PI-Join consists of two stages: JPIPT construction stage (JCS) and result output stage (ROS). JCS performs a cache-conscious construction algorithm on join attributes which are kept in column-oriented model to obtain join positional index pair table (JPIPT) of join results faster. The obtained JPIPT is used in ROS to retrieve results in a one-pass sequential selective scan on each table. We provide the correctness proof and cost analysis of PI-Join. Our experimental results indicate that PI-Join has a significant advantage over the existing join algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ailamaki A, DeWitt D, Hill M, Wood D (1999) DBMSs on a modern processor: where does time go? In: Proceedings of the 25th international conference on very large data bases, pp 266–277
An J, Chen H, Furuse K, Ohbo N (2005) CVA file: an index structure for high-dimensional datasets. Knowl Inf Syst 7(3): 337–357
Article Google Scholar
Bentley J (2000) Programming pearls. 2nd edn. ACM Press/Addison-Wesley, New York
Google Scholar
Böhm C, Krebs F (2004) The k-nearest neighbour join: turbo charging the KDD process. Knowl Inf Syst 6(6): 728–749
Article Google Scholar
Boncz P, Manegold S, Kersten M (1999) Database architecture optimized for the new bottleneck: memory access. In: Proceedings of the 25th international conference on very large data bases, pp 54–65
Chen H, Liu J, Furuse K, Yu J, Ohbo N (2011) Indexing expensive functions for efficient multi-dimensional similarity search. Knowl Inf Syst 27(2): 165–192
Article Google Scholar
Chen S, Ailamaki A, Gibbons P, Mowry T (2004) Improving hash join performance through prefetching. In: Proceedings of the 20th international conference on data engineering, pp 116–127
Chen S, Ailamaki A, Gibbons P, Mowry T (2007) Improving hash join performance through prefetching. ACM Trans Database Syst 32(3): 17
Article Google Scholar
DeWitt D, Katz R, Olken F, Shapiro L, Stonebraker M, Wood D (1984) Implementation techniques for main memory database systems. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 1–8
DeWitt D, Naughton J, Schneider D (1991) An evaluation of non-equijoin algorithms. In: Proceedings of the 17th international conference on very large data bases, pp 443–452
Garcia-Molina H, Ullman J, Widom J (2008) Database systems: the complete book, 2nd edn. Prentice Hall, Upper Saddle River
Google Scholar
Graefe G (1992) Five performance enhancements for hybrid hash join. Technical report, CU-CS-606-92, University of Colorado at Boulder
Google Scholar
Graefe G (1993) Query evaluation techniques for large databases. ACM Comput Surv 25(2): 73–170
Article Google Scholar
Gray J, Shenoy P (2000) Rules of thumb in data engineering. In: Proceedings of the 16th international conference on data engineering, pp 3–12
He B, Luo Q (2006) Cache-oblivious nested-loop joins. In: Proceedings of the 2006 ACM CIKM international conference on information and knowledge management, pp 718–727
He B, Luo Q (2007) Cache-oblivious query processing. In: Proceedings of 3rd biennial conference on innovative data systems research, pp 44–55
He B, Luo Q (2008) Cache-oblivious databases: limitations and opportunities. ACM Trans Database Syst 33(2): 8
Article Google Scholar
Kamath M, Ramamritham K Bucket skip merge join: a scalable algorithm for join processing in very large databases using indexes. Technical report, CS-TR-96-20, University of Massachusetts
Kim C, Sedlar E, Chhugani J et al (2009) Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs. Proc VLDB Endow 2(2): 1378–1389
Google Scholar
Kranen P, Assent I, Baldauf C, Seidl T (2010) The ClusTree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst 1–24. doi:10.1007/s10115-010-0342-8
Li Z, Ross K (1999) Fast joins using join indices. VLDB J 8(1): 1–24
Article MATH Google Scholar
Manegold S, Boncz P, Kersten M (2000) What happens during a join? Dissecting CPU and Memory optimization effects. In: Proceedings of the 26th international conference on very large data bases, pp 339–350
Mishra P, Eich M (1992) Join processing in relational databases. ACM Comput Surv 24(1): 63–113
Article Google Scholar
Nyberg C, Barclay T, Cvetanovic Z, Gray J, Lomet D (1995) AlphaSort: a cache-sensitive parallel external sor. VLDB J 4(4): 603–627
Article Google Scholar
O’Neil P, Graefe G (1995) Multi-table joins through bitmapped join indices. SIGMOD Rec 24(3): 8–11
Article Google Scholar
Oracle (2010) Oracle database 11g release 2 (11.2) documentation. http://www.oracle.com/pls/db112/homepage
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3): 1065–1076
Article MathSciNet MATH Google Scholar
Raykar V, Duraiswami R (2006) Fast optimal bandwidth selection for kernel density estimation. In: Proceedings of the 6th SIAM international conference on data mining, pp 524–528
Selinger P, Astrahan M, Chamberlin D, Lorie R, Price T (1979) Access path selection in a relational database management system. In: Proceedings of the 1979 ACM SIGMOD international conference on management of data, pp 23–34
Shahabi C, Khan L, McLeod D (2000) A probe-based technique to optimize join queries in distributed internet databases. Knowl Inf Syst 2(3): 373–385
Article Google Scholar
Shahabi C, Kolahdouzan M, Safar M (2004) Alternative strategies for performing spatial joins on web sources. Knowl Inf Syst 6(3): 290–314
Article Google Scholar
Shapiro L (1986) Join processing in database systems with large main memories. ACM Trans Database Syst 11(3): 239–264
Article Google Scholar
Shatdal A, Kant C, Naughton J (1994) Cache conscious algorithms for relational query processing. In: Proceedings of the 20th international conference on very large data bases, pp 510–521
Stonebraker M, Abadi D, Batkin A et al (2005) C-store: a column-oriented DBMS. In: Proceedings of the 31st international conference on very large data bases, pp 553–564
Tan K, Goh C, Lee M, Ooi B (1999) Efficient join processing using partial precomputation. Knowl Inf Syst 1(4): 481–514
Google Scholar
Tom’s Hardware (2006) Hard drives: 40 MB to 750 GB—3,500 to 10,000 RPM. http://www.tomshardware.com/reviews/15-years-of-hard-drive-history,1368-2.html
Valduriez P (1987) Join indices. ACM Trans Database Syst 12(2): 218–246
Article Google Scholar
Wong H, Liu H, Olken F, Rotem D, Wong L (1985) Bit transposed files. In: Proceedings of the 11th international conference on very large data bases, pp 448–457
Wong H, Li J, Olken F, Rotem D, Wong L (1986) Bit transposition for very large scientific and statistical databases. Algorithmica 1(3): 289–309
Article MathSciNet Google Scholar
Zhang M, Alhajj R (2010) Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 1–26
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Xixian Han & Jianzhong Li
The Academy of Fundamental and Interdisciplinary Sciences, Harbin Institute of Technology, Harbin, China
Donghua Yang

Authors

Xixian Han
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Donghua Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Donghua Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, X., Li, J. & Yang, D. PI-Join: Efficiently processing join queries on massive data. Knowl Inf Syst 32, 527–557 (2012). https://doi.org/10.1007/s10115-011-0429-x

Download citation

Received: 01 August 2010
Revised: 07 March 2011
Accepted: 23 May 2011
Published: 01 July 2011
Issue Date: September 2012
DOI: https://doi.org/10.1007/s10115-011-0429-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PI-Join: Efficiently processing join queries on massive data

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Data dependencies for query optimization: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PI-Join: Efficiently processing join queries on massive data

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Data dependencies for query optimization: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation