Abstract
To support the large-scale analytic for Web applications, the backend distributed data management system must provide the service for accessing massive data. Thus, the scan operation becomes a critical step. To improve the performance of scan operation, modern data management systems usually rely on the simple partitioned parallelism. Under the partitioned parallelism, tables are consist of several partitions, and each scan operation can access multiple partitions separately. It is a simple and effective solution for a single scan operation. In this paper, we consider managing multiple scan operations together, where the situation is no longer straightforward. To address the problem, we propose the parallel strategy to schedule batched scan operations together beyond the simple partitioned parallelism. For the sake of performance, first, we utilize replications to increase the parallelism and propose an effective load balancing strategy over replication nodes based on linear programming. Second, we propose an effective chunk-based scheduling algorithm for multi-threading parallelism on each node to guarantee all threads have even workloads under a qualified cost model. Finally, we integrate our parallel scan strategy into an open-sourced distributed data management system. Experimental evaluation shows our parallel scan strategy significantly improves the performance of scan operation.
Similar content being viewed by others
Notes
Similar results are also tested in [15]
References
Apache. HBase. http://hbase.apache.org/
Bal, H.E., Kaashoek, M.F., Tanenbaum, A.S., Jansen, J.: Replication techniques for speeding up parallel applications on distributed systems. Concurr. Pract. Exper. 4, 337–355 (1992)
Bouganim, L., Florescu, D., Valduriez, P.: Dynamic load balancing in hierarchical parallel database systems. In: Proc. of the Int. Conf. on Very Large Data Bases (VLDB). Mumbai (1996)
Bouganim, L., Florescu, D., Valduriez, P.: Load balancing for parallel query execution on NUMA multiprocessors. Distrib. Parallel Datab. 7(1), 99–121 (1999)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data. In: Proceedings of 7th Symposium on Operating System Design and Implementation (OSDI), pp. 205218 (2006)
Chen, M.-S., Yu, P.S., Wu, K.-L.: Scheduling and processor allocation for parallel execution of multi-join queries. In: Proceedings of the Eighth International Conference on Data Engineering, pp 58–67. IEEE Computer Society, Washington, DC (1992)
Cockshott, W.P.: Addressing mechanisms and persistent programming chapter 15 in Atkinson others (1988)
DeWitt, D., Gray, J.: Parallel database systems: The future of high performance database processing. Commun. ACM 36, 6 (1992)
Du, J., Leung, J.Y.T.: Complexity of scheduling parallel task systems. SIAM J. Discret Math. SIAM (1989)
Ferhatosmanoglu, H., Tosun, A.S., Canahuate, G., Ramachandran, A.: Efficient parallel processing of range queries through replicated declustering. Distrib. Parallel Datab. 20(2), 117–147 (2006)
Frikken, K., Atallah, M., Prabhakar, S., Safavi-Naini, R.: Optimal parallel i/o for range queries through replication. In: Proceedings of 13th International Conference of Database and Expert Systems Applications (DEXA), pp. 669–678 (2002)
Graefe, G.: Volcano-an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng., 6(1) (1994)
IBM: DB2. intra-partition parallelism https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005323.html (2009)
Johnson, R., Hardavellas, N., Pandis, I., Mancheril, N., Harizopoulos, S., Sabirli, K., Ailamaki, A., Falsafi, B.: To share or not to share? In: VLDB (2007)
Krikellas, K., Cintra, M., Viglas, S.: Scheduling threads for intra-query parallelism on multicore processors. In: EDBT (2010)
Krompass, S., Kuno, H., Dayal, U., Kemper, A.: Dynamic workload management for very large data warehouses: Juggling feathers and bowling balls. In: Proc. of the 33rd Intl. Conf. on Very Large Databases (VLDB), pp. 1105–1115 (2007)
Kuo, T.-W., Wei, C.-H., Lam, K.-y.: Real-time data access control on B-tree index structures. In: IEEE 15th International Conference on Data Engineering. Sydney (1999)
Lee, R., Ding, X., Chen, F., Lu, Q., Zhang, X.: MCC-DB: Minimizing cache conflicts in multi-core processors for databases. PVLDB 2(1), 373–384 (2009)
Lim, L., Wang, M., Vitter, J.S.: SASH: A self-adaptive histogram set for dynamically changing workloads. In: Proceedings of 29th VLDB Conference. Berlin (2003)
Microsoft: SQL Server parallelism enhancements http://sqlmag.com/sql-server-2008/parallelism-enhancements-sql-server-2008 (2008)
OceanBase. https://github.com/alibaba/oceanbase/
Open Source DB. https://www.postgresql.org/
Oracle Database 11g. Parallel execution https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm. (2007)
Pan, C.S., Zymbler, M.L.: Encapsulation of partitioned parallelism into open-source database management systems. Program Comput. Softw. 41(6), 350–360 (2015)
Percival, C.: Cache missing for fun and profit. In: Proc. of BSDCan 2005 (2005)
Pivotal. GREENPLUM DB. http://greenplum.org/
Qiao, L., Raman, V., Reiss, F., Haas, P.J., Lohman, G.M.: Main-memory scan sharing for multi-core CPUs. Proc. VLDB Endow. 1(1), 610–621 (2008)
Rahm, E., Stöhr, T.: Analysis of parallel scan processing in parallel shared disk database systems. In: Proc. EURO-PAR Conf., LNCS, p. 966. Springer (1995)
Ristau, B., Fettweis, G.: An optimization methodology for memory allocation and task scheduling in SoCs via linear programming SAMOS 89–98 (2006)
Sokolinsky, LB.: Survey of architectures of parallel database system. Program Comput. Softw. 30(6), 337–346 (2004)
Son, SH.: Replicated data management in distributed database systems, ACM SIGMOD, vol. 17 Issue 4, pp 62–69. ACM, New York (1988)
Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Proceeding ecs’07 Experimental computer science on Experimental computer science, pp. 3–3. San Diego (2007)
Valduriez, P.: Parallel Database Systems: Open Problems and New Issues, Distributed and Parallel Databases. Springer (1993)
Acknowledgments
This is work is partially supported by National Science Foundation of China under grant numbers 61702189, 61432006 and 61672232, and Youth Science and Technology - “Yang Fan” Program of Shanghai under grant number 17YF1427800. Huiqi Hu is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Web and Big Data
Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao
Rights and permissions
About this article
Cite this article
Wei, X., Hu, H., Duan, H. et al. Parallel strategy for multiple scan operations with data replication. World Wide Web 22, 2561–2587 (2019). https://doi.org/10.1007/s11280-018-0625-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-018-0625-7