Parallel strategy for multiple scan operations with data replication

Wei, Xing; Hu, Huiqi; Duan, Huichao; Qian, Weining; Zhou, Aoying

doi:10.1007/s11280-018-0625-7

Parallel strategy for multiple scan operations with data replication

Published: 13 August 2018

Volume 22, pages 2561–2587, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Xing Wei¹,
Huiqi Hu ORCID: orcid.org/0000-0001-5220-3166¹,
Huichao Duan¹,
Weining Qian¹ &
…
Aoying Zhou¹

378 Accesses
3 Altmetric
Explore all metrics

Abstract

To support the large-scale analytic for Web applications, the backend distributed data management system must provide the service for accessing massive data. Thus, the scan operation becomes a critical step. To improve the performance of scan operation, modern data management systems usually rely on the simple partitioned parallelism. Under the partitioned parallelism, tables are consist of several partitions, and each scan operation can access multiple partitions separately. It is a simple and effective solution for a single scan operation. In this paper, we consider managing multiple scan operations together, where the situation is no longer straightforward. To address the problem, we propose the parallel strategy to schedule batched scan operations together beyond the simple partitioned parallelism. For the sake of performance, first, we utilize replications to increase the parallelism and propose an effective load balancing strategy over replication nodes based on linear programming. Second, we propose an effective chunk-based scheduling algorithm for multi-threading parallelism on each node to guarantee all threads have even workloads under a qualified cost model. Finally, we integrate our parallel scan strategy into an open-sourced distributed data management system. Experimental evaluation shows our parallel scan strategy significantly improves the performance of scan operation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Comparing Oracle and PostgreSQL, Performance and Optimization

Spatial data management in apache spark: the GeoSpark perspective and beyond

Article 22 October 2018

Jia Yu, Zongsi Zhang & Mohamed Sarwat

DAOS: A Scale-Out High Performance Storage Stack for Storage Class Memory

Notes

Similar results are also tested in [15]
http://www.tpc.org/tpch/default.asp

References

Apache. HBase. http://hbase.apache.org/
Bal, H.E., Kaashoek, M.F., Tanenbaum, A.S., Jansen, J.: Replication techniques for speeding up parallel applications on distributed systems. Concurr. Pract. Exper. 4, 337–355 (1992)
Article Google Scholar
Bouganim, L., Florescu, D., Valduriez, P.: Dynamic load balancing in hierarchical parallel database systems. In: Proc. of the Int. Conf. on Very Large Data Bases (VLDB). Mumbai (1996)
Bouganim, L., Florescu, D., Valduriez, P.: Load balancing for parallel query execution on NUMA multiprocessors. Distrib. Parallel Datab. 7(1), 99–121 (1999)
Article Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data. In: Proceedings of 7th Symposium on Operating System Design and Implementation (OSDI), pp. 205218 (2006)
Chen, M.-S., Yu, P.S., Wu, K.-L.: Scheduling and processor allocation for parallel execution of multi-join queries. In: Proceedings of the Eighth International Conference on Data Engineering, pp 58–67. IEEE Computer Society, Washington, DC (1992)
Cockshott, W.P.: Addressing mechanisms and persistent programming chapter 15 in Atkinson others (1988)
Google Scholar
DeWitt, D., Gray, J.: Parallel database systems: The future of high performance database processing. Commun. ACM 36, 6 (1992)
Google Scholar
Du, J., Leung, J.Y.T.: Complexity of scheduling parallel task systems. SIAM J. Discret Math. SIAM (1989)
Ferhatosmanoglu, H., Tosun, A.S., Canahuate, G., Ramachandran, A.: Efficient parallel processing of range queries through replicated declustering. Distrib. Parallel Datab. 20(2), 117–147 (2006)
Article Google Scholar
Frikken, K., Atallah, M., Prabhakar, S., Safavi-Naini, R.: Optimal parallel i/o for range queries through replication. In: Proceedings of 13th International Conference of Database and Expert Systems Applications (DEXA), pp. 669–678 (2002)
Google Scholar
Graefe, G.: Volcano-an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng., 6(1) (1994)
Article Google Scholar
IBM: DB2. intra-partition parallelism https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005323.html (2009)
Johnson, R., Hardavellas, N., Pandis, I., Mancheril, N., Harizopoulos, S., Sabirli, K., Ailamaki, A., Falsafi, B.: To share or not to share? In: VLDB (2007)
Krikellas, K., Cintra, M., Viglas, S.: Scheduling threads for intra-query parallelism on multicore processors. In: EDBT (2010)
Krompass, S., Kuno, H., Dayal, U., Kemper, A.: Dynamic workload management for very large data warehouses: Juggling feathers and bowling balls. In: Proc. of the 33rd Intl. Conf. on Very Large Databases (VLDB), pp. 1105–1115 (2007)
Kuo, T.-W., Wei, C.-H., Lam, K.-y.: Real-time data access control on B-tree index structures. In: IEEE 15th International Conference on Data Engineering. Sydney (1999)
Lee, R., Ding, X., Chen, F., Lu, Q., Zhang, X.: MCC-DB: Minimizing cache conflicts in multi-core processors for databases. PVLDB 2(1), 373–384 (2009)
Google Scholar
Lim, L., Wang, M., Vitter, J.S.: SASH: A self-adaptive histogram set for dynamically changing workloads. In: Proceedings of 29th VLDB Conference. Berlin (2003)
Microsoft: SQL Server parallelism enhancements http://sqlmag.com/sql-server-2008/parallelism-enhancements-sql-server-2008 (2008)
OceanBase. https://github.com/alibaba/oceanbase/
Open Source DB. https://www.postgresql.org/
Oracle Database 11g. Parallel execution https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm. (2007)
Pan, C.S., Zymbler, M.L.: Encapsulation of partitioned parallelism into open-source database management systems. Program Comput. Softw. 41(6), 350–360 (2015)
Article Google Scholar
Percival, C.: Cache missing for fun and profit. In: Proc. of BSDCan 2005 (2005)
Pivotal. GREENPLUM DB. http://greenplum.org/
Qiao, L., Raman, V., Reiss, F., Haas, P.J., Lohman, G.M.: Main-memory scan sharing for multi-core CPUs. Proc. VLDB Endow. 1(1), 610–621 (2008)
Article Google Scholar
Rahm, E., Stöhr, T.: Analysis of parallel scan processing in parallel shared disk database systems. In: Proc. EURO-PAR Conf., LNCS, p. 966. Springer (1995)
Ristau, B., Fettweis, G.: An optimization methodology for memory allocation and task scheduling in SoCs via linear programming SAMOS 89–98 (2006)
Sokolinsky, LB.: Survey of architectures of parallel database system. Program Comput. Softw. 30(6), 337–346 (2004)
Article Google Scholar
Son, SH.: Replicated data management in distributed database systems, ACM SIGMOD, vol. 17 Issue 4, pp 62–69. ACM, New York (1988)
Google Scholar
Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Proceeding ecs’07 Experimental computer science on Experimental computer science, pp. 3–3. San Diego (2007)
Valduriez, P.: Parallel Database Systems: Open Problems and New Issues, Distributed and Parallel Databases. Springer (1993)

Download references

Acknowledgments

This is work is partially supported by National Science Foundation of China under grant numbers 61702189, 61432006 and 61672232, and Youth Science and Technology - “Yang Fan” Program of Shanghai under grant number 17YF1427800. Huiqi Hu is the corresponding author.

Author information

Authors and Affiliations

School of Data Science and Engineering, East China Normal University, Shanghai, China
Xing Wei, Huiqi Hu, Huichao Duan, Weining Qian & Aoying Zhou

Authors

Xing Wei
View author publications
You can also search for this author in PubMed Google Scholar
Huiqi Hu
View author publications
You can also search for this author in PubMed Google Scholar
Huichao Duan
View author publications
You can also search for this author in PubMed Google Scholar
Weining Qian
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huiqi Hu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Web and Big Data

Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, X., Hu, H., Duan, H. et al. Parallel strategy for multiple scan operations with data replication. World Wide Web 22, 2561–2587 (2019). https://doi.org/10.1007/s11280-018-0625-7

Download citation

Received: 30 November 2017
Revised: 23 May 2018
Accepted: 19 July 2018
Published: 13 August 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11280-018-0625-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Parallel strategy for multiple scan operations with data replication

Abstract

Access this article

Similar content being viewed by others

Comparing Oracle and PostgreSQL, Performance and Optimization

Spatial data management in apache spark: the GeoSpark perspective and beyond

DAOS: A Scale-Out High Performance Storage Stack for Storage Class Memory

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel strategy for multiple scan operations with data replication

Abstract

Access this article

Similar content being viewed by others

Comparing Oracle and PostgreSQL, Performance and Optimization

Spatial data management in apache spark: the GeoSpark perspective and beyond

DAOS: A Scale-Out High Performance Storage Stack for Storage Class Memory

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation