skip to main content
10.1145/3626246.3653377acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory

Published: 09 June 2024 Publication History

Abstract

Primary-secondary databases often have limited write throughput as they rely on a single primary node. To improve this, some systems use a shared-nothing architecture for scalable multi-primary clusters. However, these face performance issues due to distributed transaction overheads. Recently, shared-storage-based multi-primary cloud-native databases have emerged to avoid these issues, but they still struggle with performance in high-conflict scenarios, often due to expensive conflict resolution and inefficient data fusion.
This paper proposes PolarDB-MP, an innovative multi-primary cloud-native database that leverages both disaggregated shared memory and storage. In PolarDB-MP, each node has equal access to all data, enabling transactions to be processed on individual nodes without the need for distributed transactions. At the core of PolarDB-MP is the Polar Multi-Primary Fusion Server (PMFS), built on disaggregated shared memory. PMFS plays a critical role in facilitating global transaction coordination and enhancing buffer fusion, seamlessly integrated with RDMA for minimal latency. Its three main functionalities include Transaction Fusion for transaction ordering and visibility, Buffer Fusion providing a distributed shared buffer, and Lock Fusion for cross-node concurrency control. Moreover, PolarDB-MP introduces an LLSN design, establishing a partial order for write-ahead logs generated across different nodes, accompanied by a tailored recovery policy. Our evaluations of PolarDB- MP demonstrate its superior performance when compared to the state-of-the-art solutions. Notably, PolarDB-MP is already in production and undergoing commercial trials at Alibaba Cloud. To our knowledge, PolarDB-MP is the first multi-primary cloud-native database that utilizes disaggregated shared memory and shared storage for transaction coordination and buffer fusion.

Supplemental Material

MP4 File
Presentation video

References

[1]
Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, et al. Socrates: The New SQL Server in the Cloud. In Proceedings of the 2019 International Conference on Management of Data, pages 1743--1756, 2019.
[2]
AWS. Amazon Aurora Limitless Database. https://aws.amazon.com/cn/blogs/ aws/join-the-preview-amazon-aurora-limitless-database/, 2023.
[3]
Eric Boutin and Steve Abraham. Amazon Aurora Multi-Master: Scaling Out Database Write Performance. https://d1.awsstatic.com/events/reinvent/2019/ REPEAT_1_Amazon_Aurora_Multi-Master_Scaling_out_database_write_ performance_DAT404-R1.pdf, 2019.
[4]
Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. Efficient Distributed Memory Management with RDMA and Caching. Proceedings of the VLDB Endowment, 11(11):1604--1617, 2018.
[5]
Mustafa Canim, George A. Mihaila, Bishwaranjan Bhattacharjee, Kenneth A. Ross, and Christian A. Lang. SSD Bufferpool Extensions for Database Systems. Proceedings of the VLDB Endowment, 3:1435--1446, 2010.
[6]
Wei Cao, Feifei Li, Gui Huang, Jianghang Lou, Jianwei Zhao, Dengcheng He, Mengshi Sun, Yingqiang Zhang, Sheng Wang, Xueqiang Wu, et al. PolarDB-X: An Elastic Distributed Relational Database for Cloud-Native Applications. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 2859--2872. IEEE, 2022.
[7]
Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui Wang, and Guoqing Ma. PolarFS: an Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database. Proceedings of the VLDB Endowment, 11(12):1849--1862, 2018.
[8]
Wei Cao, Yingqiang Zhang, Xinjun Yang, Feifei Li, Sheng Wang, Qingda Hu, Xuntao Cheng, Zongzhi Chen, Zhenjun Liu, Jing Fang, et al. PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers. In Proceedings of the 2021 International Conference on Management of Data, pages 2477--2489, 2021.
[9]
Sashikanth Chandrasekaran and Roger Bamford. Shared Cache-the Future of Parallel Databases. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405), pages 840--840. IEEE Computer Society, 2003.
[10]
Yun-Sheng Chang, Ralf Jung, Upamanyu Sharma, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. Verifying vMVCC, A High-performance Transaction Library Using Multi-version Concurrency Control. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI '23), pages 871--886, Boston, MA, July 2023. USENIX Association.
[11]
James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. Spanner: Google's Globally Distributed Database. ACM Transactions on Computer Systems (TOCS), 31(3):1--22, 2013.
[12]
Transaction Processing Performance Council. On-Line Transaction Processing Benchmark. https://www.tpc.org/tpcc/, 1992. "[accessed-April-2022]".
[13]
Carlo Curino, Evan Philip Charles Jones, Yang Zhang, and Samuel R Madden. Schism: A Workload-driven Approach to Database Replication and Partitioning. 2010.
[14]
Sudipto Das, Miroslav Grbic, Igor Ilic, Isidora Jovandic, Andrija Jovanovic, Vivek R Narasayya, Miodrag Radulovic, Maja Stikic, Gaoxiang Xu, and Surajit Chaudhuri. Automatically Indexing Millions of Databases in Microsoft Azure SQL Database. In Proceedings of the 2019 International Conference on Management of Data, pages 666--679, 2019.
[15]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-Value Store. ACM SIGOPS operating systems review, 41(6):205--220, 2007.
[16]
Alex Depoutovitch, Chong Chen, Per-Ake Larson, Jack Ng, Shu Lin, Guanzhu Xiong, Paul Lee, Emad Boctor, Samiao Ren, LengdongWu, et al. Taurus mm: Bringing multi-master to the cloud. Proceedings of the VLDB Endowment, 16(12):3488-- 3500, 2023.
[17]
Aleksandar Dragojevi?, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM:Fast Remote Memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 401--414, 2014.
[18]
Aleksandar Dragojevi?, Dushyanth Narayanan, Edmund B Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No Compromises: Distributed Transactions with Consistency, Availability, and Performance. In Proceedings of the 25th symposium on operating systems principles, pages 54--70, 2015.
[19]
Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, et al. TiDB: A Raft-based HTAP Database. Proceedings of the VLDB Endowment, 13(12):3072--3084, 2020.
[20]
Jeffrey W. Josten, C Mohan, Inderpal Narang, and James Z. Teng. DB2's Use of the Coupling Facility for Data Sharing. IBM Systems Journal, 36(2):327--351, 1997.
[21]
Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan PC Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, et al. H-store: a High-performance, Distributed Main Memory Transaction Processing System. Proceedings of the VLDB Endowment, 1(2):1496-- 1499, 2008.
[22]
Alexey Kopytov. Sysbench: A System Performance Benchmark. http://sysbench. sourceforge. net/, 2004.
[23]
Tim Kraska, Martin Hentschel, Gustavo Alonso, and Donald Kossmann. Consistency Rationing in the Cloud: Pay Only When It Matters. Proceedings of the VLDB Endowment, 2(1):253--264, 2009.
[24]
Vashudha Krishnaswamy, Divyakant Agrawal, John L. Bruno, and Amr El Abbadi. Relative Serializability: An Approach for Relaxing the Atomicity of Transactions. journal of computer and system sciences, 55(2):344--354, 1997.
[25]
Avinash Lakshman and Prashant Malik. Cassandra: A Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review, 44(2):35--40, 2010.
[26]
Willis Lang, Frank Bertsch, David J DeWitt, and Nigel Ellis. Microsoft Azure SQL Database Telemetry. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 189--194, 2015.
[27]
Feifei Li. Cloud-native Database Systems at Alibaba: Opportunities and Challenges. Proceedings of the VLDB Endowment, 12(12):2263--2272, 2019.
[28]
Simon Loesing, Markus Pilman, Thomas Etter, and Donald Kossmann. On the Design and Scalability of Distributed Shared-data Databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 663--676, 2015.
[29]
David T. McWherter, Bianca Schroeder, Anastassia Ailamaki, and Mor Harchol- Balter. Priority Mechanisms for OLTP and Transactional Web Applications. In Proceedings of the 20th International Conference on Data Engineering (ICDE 2004), pages 535--546, Boston, MA, USA, 2004.
[30]
Chandrasekaran Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Transactions on Database Systems (TODS), 17(1):94--162, 1992.
[31]
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Latency-Tolerant Software Distributed Shared Memory. In 2015 USENIX Annual Technical Conference (USENIX ATC '15), pages 291--305, Santa Clara, CA, USA, 2015.
[32]
Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15), pages 677--689, 2015.
[33]
Simo Neuvonen, AntoniWolski, Markku Manner, and Vilho Raatikk. Telecom Application Transaction Processing Benchmark. http://tatpbenchmark.sourceforge. net/, 2011.
[34]
NVIDIA. NVIDIA CONNECTX-7 NDR 400G INFINIBAND ADAPTER CARD. https://www.nvidia.com/content/dam/en-zz/Solutions/networking/ infiniband-adapters/infiniband-connectx7-data-sheet.pdf, 2022. "[accessed- November-2023]".
[35]
Andrew Pavlo, Carlo Curino, and Stanley Zdonik. Skew-aware Automatic Database Partitioning in Shared-nothing, Parallel OLTP Systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 61--72, 2012.
[36]
Andrew Pavlo, Evan PC Jones, and Stanley Zdonik. On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems. arXiv preprint arXiv:1110.6647, 2011.
[37]
Somasundaram Perianayagam, Akshat Vig, Doug Terry, Swami Sivasubramanian, James Christopher Sorenson III, Akhilesh Mritunjai, Joseph Idziorek, Niall Gallagher, Mostafa Elhemali, Nick Gordon, et al. Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 1037--1048, 2022.
[38]
Abdul Quamar, K Ashwin Kumar, and Amol Deshpande. SWORD: Scalable Workload-aware Data Placement for Transactional Workloads. In Proceedings of the 16th international conference on extending database technology, pages 430--441, 2013.
[39]
Kun Ren, Dennis Li, and Daniel J Abadi. SLOG: Serializable, Low-latency, Georeplicated Transactions. Proceedings of the VLDB Endowment, 12(11), 2019.
[40]
Kun Ren, Alexander Thomson, and Daniel J Abadi. Lightweight Locking for Main Memory Database Systems. Proceedings of the VLDB Endowment, 6(2):145--156, 2012.
[41]
Chaoyi Ruan, Yingqiang Zhang, Chao Bi, Xiaosong Ma, Hao Chen, Feifei Li, Xinjun Yang, Cheng Li, Ashraf Aboulnaga, and Yinlong Xu. Persistent Memory Disaggregation for Cloud-Native Relational Databases. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 498--512, 2023.
[42]
Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. Distributed Shared Persistent Memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17), pages 323--337, 2017.
[43]
Michael Stonebraker. The Case for Shared Nothing. IEEE Database Eng. Bull., 9(1):4--9, 1986.
[44]
Michael Stonebraker and U?gur Çetintemel. "One Size Fits All" An Idea Whose Time Has Come and Gone. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker, pages 441--462. 2018.
[45]
Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, et al. CockroachDB: The Resilient Geo-distributed SQL Database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1493--1509, 2020.
[46]
Konstantin Taranov, S. D. Girolamo, and T. Hoefler. CoRM: Compactable Remote Memory over RDMA. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD), pages 1811--1824, 2021.
[47]
Alexander Thomson and Daniel J Abadi. The Case for Determinism in Database Systems. Proceedings of the VLDB Endowment, 3(1--2):70--80, 2010.
[48]
Alexander Thomson, Thaddeus Diamond, Shu-ChunWeng, Kun Ren, Philip Shao, and Daniel J Abadi. Calvin: Fast Distributed Transactions for Partitioned Database Systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 1--12, 2012.
[49]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. Amazon Aurora: Design Considerations for High Throughput Cloud-native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1041--1052, 2017.
[50]
XingdaWei, Zhiyuan Dong, Rong Chen, and Haibo Chen. Deconstructing RDMAenabled Distributed Transactions: Hybrid is Better! In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 233--251, 2018.
[51]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. Fast inmemory Transaction Processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 87--104, 2015.
[52]
Tom White. Hadoop: The Definitive Guide. " O'Reilly Media, Inc.", 2012.
[53]
Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. An Empirical Evaluation of In-Memory Multi-Version Concurrency Control. Proceedings of the VLDB Endowment, 10(7):781--792, 2017.
[54]
Xinjun Yang, Yingqiang Zhang, Hao Chen, Chuan Sun, Feifei Li, and Wenchao Zhou. PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads. Proceedings of the VLDB Endowment, 16(12):3754-- 3767, 2023.
[55]
Zhenkun Yang, Chuanhui Yang, Fusheng Han, Mingqiang Zhuang, Bing Yang, Zhifeng Yang, Xiaojun Cheng, Yuzhong Zhao, Wenhui Shi, Huafeng Xi, et al. OceanBase: a 707 Million tpmC Distributed Relational Database System. Proceedings of the VLDB Endowment, 15(12):3385--3397, 2022.
[56]
Matei Zaharia, Reynold S Xin, PatrickWendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. Apache Spark: a Unified Engine for Big Data Processing. Communications of the ACM, 59(11):56--65, 2016.
[57]
Erfan Zamanian, Carsten Binnig, Tim Kraska, and Tim Harris. The End of a Myth: Distributed Transactions Can Scale. arXiv preprint arXiv:1607.00655, 2016.
[58]
Erfan Zamanian, Carsten Binnig, and Abdallah Salama. Locality-aware Partitioning in Parallel Database Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 17--30, 2015.
[59]
Ming Zhang, Yu Hua, Pengfei Zuo, and Lurong Liu. FORD: Fast One-sided RDMAbased Distributed Transactions for Disaggregated Persistent Memory. In 20th USENIX Conference on File and Storage Technologies (FAST 22), pages 51--68, 2022.
[60]
Yingqiang Zhang, Chaoyi Ruan, Cheng Li, Xinjun Yang, Wei Cao, Feifei Li, Bo Wang, Jing Fang, Yuhui Wang, Jingze Huo, and Chao Bi. Towards Cost- Effective and Elastic Cloud Database Deployment via Memory Disaggregation. Proceedings of the VLDB Endowment, 14(10):1900--1912, 2021.
[61]
Tao Zhu, Zhuoyue Zhao, Feifei Li, Weining Qian, Aoying Zhou, Dong Xie, Ryan Stutsman, Haining Li, and Huiqi Hu. Solar: Towards a Shared-Everything Database on Distributed Log-Structured Storage. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 795--807, 2018.
[62]
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion control for large-scale RDMA deployments. ACM SIGCOMM Computer Communication Review, 45(4):523--536, 2015.

Cited By

View all
  • (2024)GaussDB: A Cloud-Native Multi-Primary Database with Compute-Memory-Storage DisaggregationProceedings of the VLDB Endowment10.14778/3685800.368580617:12(3786-3798)Online publication date: 8-Nov-2024
  • (2024)GreenB+Tree: an energy-efficient B+tree for MIMD architecturesCCF Transactions on High Performance Computing10.1007/s42514-024-00204-zOnline publication date: 18-Dec-2024

Index Terms

  1. PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
    June 2024
    694 pages
    ISBN:9798400704222
    DOI:10.1145/3626246
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cloud-native database
    2. disaggregated shared-memory
    3. multi-primary database

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,511
    • Downloads (Last 6 weeks)95
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GaussDB: A Cloud-Native Multi-Primary Database with Compute-Memory-Storage DisaggregationProceedings of the VLDB Endowment10.14778/3685800.368580617:12(3786-3798)Online publication date: 8-Nov-2024
    • (2024)GreenB+Tree: an energy-efficient B+tree for MIMD architecturesCCF Transactions on High Performance Computing10.1007/s42514-024-00204-zOnline publication date: 18-Dec-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media