skip to main content
research-article

CORFU: A distributed shared log

Published:20 December 2013Publication History
Skip Abstract Section

Abstract

CORFU is a global log which clients can append-to and read-from over a network. Internally, CORFU is distributed over a cluster of machines in such a way that there is no single I/O bottleneck to either appends or reads. Data is fully replicated for fault tolerance, and a modest cluster of about 16--32 machines with SSD drives can sustain 1 million 4-KByte operations per second.

The CORFU log enabled the construction of a variety of distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services.

References

  1. 10Gen. 2011. MongoDB. http://www.10gen.com/white-papers.Google ScholarGoogle Scholar
  2. Anderson, T., Dahlin, M., Neefe, J., Patterson, D., Roselli, D., and Wang, R. 1995. Serverless network file systems. ACM SIGOPS Oper. Syst. Rev. 29, 109--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache. 2011. CouchDB. http://couchdb.apache.org/.Google ScholarGoogle Scholar
  4. Baker, J., Bond, C., Corbett, J., Furman, J., Khorlin, A., Larson, J., L'Eon, J., Li, Y., Lloyd, A., and Yushprakh, V. 2011. Megastore: providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). 223--234.Google ScholarGoogle Scholar
  5. Balakrishnan, M., Malkhi, D., Prabhakaran, V., Wobber, T., Wei, M., and Davis, J. 2012. Corfu: A shared log design for flash clusters. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12). USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Balakrishnan, M., Malkhi, D., Wobber, T., Wu, M., Prabhakaran, V., Wei, M., Davis, J. D., Rao, S., Zou, T., and Zuck, A. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP). ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bernstein, P., Reid, C., and Das, S. 2011. Hyder—A transactional record manager for shared flash. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR). 9--20.Google ScholarGoogle Scholar
  8. Birman, K., Malkhi, D., and Van Renesse, R. 2010. Virtually synchronous methodology for dynamic service replication. Tech. rep. MSR-TR-2010-151, Microsoft Research.Google ScholarGoogle Scholar
  9. Burrows, M. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI'06). USENIX Association, 335--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., et al. 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). ACM, New York, 143--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chockler, G. and Malkhi, D. 2005. Active disk Paxos with infinitely many processes. Distrib. Comput. 18, 1, 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. 2012. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, 251--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Davis, J., Thacker, C. P., and Chang, C. 2009. BEE3: Revitalizing computer architecture research. Tech. rep. MSR-TR-2009-45, Microsoft Research.Google ScholarGoogle Scholar
  15. Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st Symposium on Operating Systems Principles (SOSP'07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Defago, X., Schiper, A., and Urban, P. 2003. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Comput. Surv. 36, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gafni, E. and Lamport, L. 2000. Disk Paxos. In Proceedings of the 14th International Conference on Distributed Computing (DISC'00). Springer, Berlin, 330--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hartman, J. H. and Ousterhout, J. K. 1995. The zebra striped network file system. ACM Trans. Comput. Syst. 13, 3, 274--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Haskin, R., Malachi, Y., and Chan, G. 1988. Recovery management in quicksilver. ACM Trans. Comput. Syst. 6, 1, 82--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Herlihy, M. P. and Wing, J. M. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3, 463--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Holbrook, H. W., Singhal, S. K., and Cheriton, D. R. 1995. Log-based receiver-reliable multicast for distributed interactive simulation. SIGCOMM Comput. Commun. Rev. 25, 4, 328--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. 2010. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (USENIXATC'10). USENIX Association, Berkeley, CA, 11--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ji, M., Veitch, A., and Wilkes, J., et al. 2003. Seneca: Remote mirroring done write. In Proceedings of the USENIX Annual Technical Conference.Google ScholarGoogle Scholar
  24. Junqueira, F. 2012. Durability with BookKeeper. In Proceedings of LADIS'12.Google ScholarGoogle Scholar
  25. Junqueira, F., Reed, B., and Yabandeh, M. 2011. Lock-free transactional support for large-scale storage systems. In Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 176--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kapritsos, M. and Junqueira, F. P. 2010. Scalable agreement: Toward ordering as a service. In Proceedings of the Sixth International Conference on Hot Topics In System Dependability (HotDep'10). USENIX Association, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lakshman, A. and Malik, P. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Comm. ACM 21, 7, 558--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 133--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lamport, L., Malkhi, D., and Zhou, L. 2009. Vertical Paxos and primary-backup replication. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC'09). ACM, New York, 312--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lamport, L., Malkhi, D., and Zhou, L. 2010. Reconfiguring a state machine. ACM SIGACT News 41, 1, 63--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lee, E. and Thekkath, C. 1996. Petal: Distributed virtual disks. ACM SIGOPS Oper. Syst. Rev. 30, 5, 84--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Linkedin. 2011. Voldemort. http://www.project-voldemort.com/voldemort/.Google ScholarGoogle Scholar
  34. Liskov, B., Ghemawat, S., Gruber, R., Johnson, P., and Shrira, L. 1991. Replication in the harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 226--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., and Zhou, L. 2004. Boxwood: Abstractions as the foundation for storage infrastructure. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 105--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mao, Y., Junqueira, F. P., and Marzullo, K. 2008. Mencius: Building efficient replicated state machines for WANS. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, 369--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Meyer, D. T., Aggarwal, G., Cully, B., Lefebvre, G., Feeley, M. J., Hutchinson, N. C., and Warfield, A. 2008. Parallax: virtual disks for virtual machines. SIGOPS Oper. Syst. Rev. 42, 4, 41--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rosenblum, M. and Ousterhout, J. K. 1991. The design and implementation of a log-structured file system. SIGOPS Oper. Syst. Rev. 25, 5, 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Schmuck, F. and Wylie, J. 1991. Experience with transactions in quicksilver. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 239--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4, 299--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Seltzer, M., Smith, K. A., Balakrishnan, H., Chang, J., McMains, S., and Padmanabhan, V. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX Technical Conference (TCON'95). USENIX Association, Berkeley, CA, 21--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Sovran, Y., Power, R., Aguilera, M. K., and Li, J. 2011. Transactional storage for geo-replicated systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11). ACM, New York, 385--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Spector, A. Z., Daniels, D., Duchamp, D., Eppinger, J. L., and Pausch, R. 1985. Distributed transactions for reliable systems. SIGOPS Oper. Syst. Rev. 19, 5, 127--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Thacker, C. P. Beehive: A many-core computer for FPGAs. Unpublished Manuscript.Google ScholarGoogle Scholar
  46. Thekkath, C. A., Mann, T., and Lee, E. K. 1997. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP'97). ACM, New York, NY, 224--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Thomson, A., Diamond, T., Weng, S.-C., Ren, K., Shao, P., and Abadi, D. J. 2012. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). ACM, New York, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Van Renesse, R. and Schneider, F. B. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 7--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wei, M., Davis, J. D., Wobber, T., Balakrishnan, M., and Malkhi, D. 2013. Beyond block i/o: implementing a distributed shared log in hardware. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR'13). ACM, New York, 21:1--21:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. XILINX. 2011. Xilinx university program xupv5-lx110t development system. http://www.xilinx.com/univ/xupv5-lx110t.htm.Google ScholarGoogle Scholar

Index Terms

  1. CORFU: A distributed shared log

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Computer Systems
            ACM Transactions on Computer Systems  Volume 31, Issue 4
            December 2013
            90 pages
            ISSN:0734-2071
            EISSN:1557-7333
            DOI:10.1145/2542150
            Issue’s Table of Contents

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 December 2013
            • Accepted: 1 March 2013
            • Received: 1 December 2012
            Published in tocs Volume 31, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader