ABSTRACT
Standard state-machine replication involves consensus on a sequence of totally ordered requests through, for example, the Paxos protocol. Such a sequential execution model is becoming outdated on prevalent multi-core servers. Highly concurrent executions on multi-core architectures introduce non-determinism related to thread scheduling and lock contentions, and fundamentally break the assumption in state-machine replication. This tension between concurrency and consistency is not inherent because the total-ordering of requests is merely a simplifying convenience that is unnecessary for consistency. Concurrent executions of the application can be decoupled with a sequence of consensus decisions through consensus on partial-order traces, rather than on totally ordered requests, that capture the non-deterministic decisions in one replica execution and to be replayed with the same decisions on others. The result is a new multi-core friendly replicated state-machine framework that achieves strong consistency while preserving parallelism in multi-thread applications. On 12-core machines with hyper-threading, evaluations on typical applications show that we can scale with the number of cores, achieving up to 16 times the throughput of standard replicated state machines.
- P. A. Alsberg and J. D. Day. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd international conference on software engineering, ICSE '76, pages 562--570. IEEE, 1976. Google ScholarDigital Library
- G. Altekar and I. Stoica. ODR: output-deterministic replay for multicore debugging. In Proceedings of the 22nd ACM symposium on operating systems principles, SOSP '09, pages 193--206. ACM, 2009. Google ScholarDigital Library
- A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. In Proceedings of the 9th USENIX symposium on operating systems design and implementation, OSDI'10, pages 1--16. USENIX, 2010. Google ScholarDigital Library
- C. Basile, Z. Kalbarczyk, and R. K. Iyer. Active replication of multithreaded applications. IEEE transactions on parallel and distributed systems, 17(5):448--465, 2006. Google ScholarDigital Library
- T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. CoreDet: a compiler and runtime system for deterministic multithreaded execution. In Proceedings of the 15th international conference on architectural support for programming languages and operating systems, ASPLOS '10, pages 53--64. ACM, 2010. Google ScholarDigital Library
- T. Bergan, J. Devietti, N. Hunt, and L. Ceze. The deterministic execution hammer: how well does it actually pound nails? In Proceedings of the 2nd workshop on determinism and correctness in parallel programming, WODET '11, pages 448--465. ACM, 2011.Google Scholar
- T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic process groups in dOs. In Proceedings of the 9th USENIX symposium on operating systems design and implementation, OSDI'10, pages 1--16. USENIX, 2010. Google ScholarDigital Library
- D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. NonStop advanced architecture. In Proceedings of the 35th international conference on dependable systems and networks, DSN '05, pages 12--21. IEEE, 2005. Google ScholarDigital Library
- W. J. Bolosky, D. Bradshaw, R. B. Haagens, N. P. Kusters, and P. Li. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th USENIX symposium on networked systems design and implementation, NSDI'11, pages 11--11. USENIX, 2011. Google ScholarDigital Library
- M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th USENIX symposium on operating systems design and implementation, OSDI '06, pages 335--350. USENIX, 2006. Google ScholarDigital Library
- T. D. Chandra, R. Griesemer, and J. Redstone. Paxos made live: an engineering perspective. In Proceedings of the 26th annual ACM symposium on principles of distributed computing, PODC '07, pages 398--407. ACM, 2007. Google ScholarDigital Library
- H. Cui, J. Wu, J. Gallagher, H. Guo, and J. Yang. Efficient deterministic multithreading through schedule relaxation. In Proceedings of the 23rd ACM symposium on operating systems principles, SOSP '11, pages 337--351. ACM, 2011. Google ScholarDigital Library
- B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. Remus: high availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX symposium on networked systems design and implementation, NSDI'08, pages 161--174. USENIX, 2008. Google ScholarDigital Library
- J. Dean and S. Ghemawat. LevelDB: A fast and lightweight key/value database library by Google., 2011. http://code.google.com/p/leveldb.Google Scholar
- J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: deterministic shared memory multiprocessing. In Proceedings of the 14th international conference on architectural support for programming languages and operating systems, ASPLOS '09, pages 85--96. ACM, 2009. Google ScholarDigital Library
- J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. RCDC: a relaxed consistency deterministic computer. In Proceedings of the 16th international conference on architectural support for programming languages and operating systems, ASPLOS '11, pages 67--78. ACM, 2011. Google ScholarDigital Library
- G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. ReVirt: enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th USENIX symposium on operating systems design and implementation, OSDI '02, pages 211--224. ACM, 2002. Google ScholarDigital Library
- B. Fitzpatrick. memcached - a distributed memory object caching system, 2011. http://memcached.org/.Google Scholar
- A. Georges, M. Christiaens, M. Ronsse, and K. De Bosschere. JaRec: a portable record/replay environment for multi-threaded Java applications. Software: practice and experience, 34:523--547, 2004. Google ScholarDigital Library
- Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: an application-level kernel for record and replay. In Proceedings of the 8th USENIX symposium on operating systems design and implementation, OSDI'08, pages 193--208. USENIX, 2008. Google ScholarDigital Library
- D. R. Hower, P. Dudnik, M. D. Hill, and D. A. Wood. Calvin: deterministic or not? Free will to choose. In Proceedings of the 2011 IEEE 17th international symposium on high performance computer architecture, HPCA '11, pages 333--334. IEEE, 2011. Google ScholarDigital Library
- M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin. All about Eve: execute-verify replication for multi-core servers. In Proceedings of the 10th USENIX symposium on operating systems design and implementation, OSDI'12, pages 237--250. USENIX, 2012. Google ScholarDigital Library
- J. Kończak, N. Santos, T. Zurkowski, P. T. Wojciechowski, and A. Schiper. JPaxos: state machine replication based on the Paxos protocol. Technical report, EPFL, 2011.Google Scholar
- R. Kotla and M. Dahlin. High throughput Byzantine fault tolerance. In Proceedings of the 34th international conference on dependable systems and networks, DSN '04, pages 575--. IEEE, 2004. Google ScholarDigital Library
- O. Laadan, N. Viennot, and J. Nieh. Transparent, lightweight application execution replay on commodity multiprocessor operating systems. In Proceedings of the 2010 international conference on measurement and modeling of computer systems, SIGMETRICS '10, pages 155--166. ACM, 2010. Google ScholarDigital Library
- F. Labs. Kyoto Cabinet: a straightforward implementation of DBM. http://www.fallabs.com/kyotocabinet/.Google Scholar
- L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558--565, 1978. Google ScholarDigital Library
- L. Lamport. The part-time parliament. ACM transaction on computer systems, 16(2):133--169, 1998. Google ScholarDigital Library
- L. Lamport. Paxos made simple. ACM SIGACT news, 32(4):18--25, 2001.Google Scholar
- L. Lamport. Generalized consensus and Paxos. Technical Report MSR-TR-2005-33, Microsoft, 2005.Google Scholar
- D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: efficient online multiprocessor replay via speculation and external determinism. In Proceedings of the 15th international conference on architectural support for programming languages and operating systems, ASPLOS '10, pages 77--90. ACM, 2010. Google ScholarDigital Library
- T. Liu, C. Curtsinger, and E. D. Berger. Dthreads: efficient deterministic multithreading. In Proceedings of the 23rd ACM symposium on operating systems principles, SOSP '11, pages 327--336. ACM, 2011. Google ScholarDigital Library
- M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: efficient deterministic multithreading in software. In Proceedings of the 14th international conference on architectural support for programming languages and operating systems, ASPLOS '09, pages 97--108. ACM, 2009. Google ScholarDigital Library
- S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. PRES: probabilistic replay with execution sketching on multiprocessors. In Proceedings of the 22nd ACM symposium on operating systems principles, SOSP '09, pages 177--192. ACM, 2009. Google ScholarDigital Library
- F. Pedone and A. Schiper. Generic broadcast. In Proceedings of the 13th international symposium on distributed computing, DISC '99, pages 94--106. Springer Verlag, 1999. Google ScholarDigital Library
- M. Ronsse and K. De Bosschere. RecPlay: a fully integrated practical record/replay system. ACM transaction on computer systems, 17(2):133--152, 1999. Google ScholarDigital Library
- F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM computer survey, 22(4):299--319, 1990. Google ScholarDigital Library
- K. Tadeusz, K. Maciej, and T. W. Pawel. Hybrid replication: state-machine-based and deferred-update replication schemes combined. In Proceedings of the 33rd international conference on distributed computing systems, ICDCS '13, pages 286--296. IEEE, 2013. Google ScholarDigital Library
- R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In Proceedings of the 6th USENIX symposium on operating systems design and implementation, OSDI'04, pages 7--7. USENIX, 2004. Google ScholarDigital Library
- K. Veeraraghavan, P. M. Chen, J. Flinn, and S. Narayanasamy. Detecting and surviving data races using complementary schedules. In Proceedings of the 23rd ACM symposium on operating systems principles, SOSP '11, pages 369--384. ACM, 2011. Google ScholarDigital Library
- K. Veeraraghavan, D. Lee, B. Wester, J. Ouyang, P. M. Chen, J. Flinn, and S. Narayanasamy. DoublePlay: parallelizing sequential logging and replay. In Proceedings of the 16th international conference on architectural support for programming languages and operating systems, ASPLOS '11, pages 15--26. ACM, 2011. Google ScholarDigital Library
- W. Xiong, S. Park, J. Zhang, Y. Zhou, and Z. Ma. Adhoc synchronization considered harmful. In Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI'10, pages 1--8. USENIX, 2010. Google ScholarDigital Library
Index Terms
- Rex: replication at the speed of multi-core
Recommendations
Parallel Deferred Update Replication
NCA '14: Proceedings of the 2014 IEEE 13th International Symposium on Network Computing and ApplicationsDeferred update replication (DUR) is an established approach to implementing highly efficient and available storage. While the throughput of read-only transactions scales linearly with the number of deployed replicas in DUR, the throughput of update ...
Optimistic transactional active replication
ICUIMC '08: Proceedings of the 2nd international conference on Ubiquitous information management and communicationCritical database applications require 2-safe replication between at least two sites for disaster-tolerant services. At the same time, they must provide consistent and low-latency results to their clients in normal cases. In this paper, we propose ...
Quorum-based synchronization protocols for multimedia replicas
Multiple replicas of multimedia objects are distributed to peers in overlay networks. In quorum-based (QB) protocols, every replica may not be up-to-date and the up-to-date replica can be found in the version counter. Multimedia objects are ...
Comments