Skip to main content
Log in

Survey of Large-Scale Data Management Systems for Big Data Applications

  • Survey
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Today, data is flowing into various organizations at an unprecedented scale. The ability to scale out for processing an enhanced workload has become an important factor for the proliferation and popularization of database systems. Big data applications demand and consequently lead to the developments of diverse large-scale data management systems in different organizations, ranging from traditional database vendors to new emerging Internet-based enterprises. In this survey, we investigate, characterize, and analyze the large-scale data management systems in depth and develop comprehensive taxonomies for various critical aspects covering the data model, the system architecture, and the consistency model. We map the prevailing highly scalable data management systems to the proposed taxonomies, not only to classify the common techniques but also to provide a basis for analyzing current system scalability limitations. To overcome these limitations, we predicate and highlight the possible principles that future efforts need to be undertaken for the next generation large-scale data management systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal D, Das S, Abbadi A. Big data and cloud computing: Current state and future opportunities. In Proc. the 14th International Conference on Extending Database Technology, March 2011, pp.530-533.

  2. Glogor G, Silviu T. Oracle Exalytics: Engineered for speed of-thought analytics. Database Systems Journal, 2011, 2(4): 3-8.

    Google Scholar 

  3. Campbell D, Kakivaya G, Ellis N. Extreme scale with full SQL language support in Microsoft SQL Azure. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.1021-1024.

  4. Yuan L, Wu L, You J, Chi Y. Rubato DB: A highly scalable staged grid database system for OLTP and big data applications. In Proc. the 23rd ACM International Conference on Information and Knowledge Management, November 2014, pp.1-10.

  5. Kallman R, Kimura H, Natkins J et al. H-store: A high performance, distributed main memory transaction processing system. In Proc. the 34th International Conference on Very Large Data Bases, August 2008, pp.1496-1499.

  6. Stonebraker M, Abadi D, Batkin A et al. C-store: A column oriented DBMS. In Proc. the 31st International Conference on Very Large Data Bases, August 2005, pp.553-564.

  7. DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. Dynamo: Amazon’s highly available key-value store. In Proc. the 21st ACM SIGOPS Symposium on Operating Systems Principles, October 2007, pp.205-220.

  8. Cooper BF, Ramakrishnan R, Srivastava U et al. PNUTS: Yahoo!’s hosted data serving platform. In Proc. the 34th International Conference on Very Large Data Bases, August 2008, pp.1277-1288.

  9. Lakshman A, Malik P. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010, 44(2): 35-40.

    Article  Google Scholar 

  10. Joshi A, Sam H, Charles L. Oracle NoSQL database-scalable, transactional key-value store. In Proc. the 2nd International Conference on Advances in Information Mining and Management, October 2012, pp.75-78.

  11. Ghemawat S, Gobioff H, Leung S T. The Google file system. In Proc. the 19th ACM Symposium on Operating Systems Principles, December 2003, pp.29-43.

  12. Chang F, Dean J, Ghemawat S et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems, 2008, 26(2): Article No.4.

  13. Baker J, Bond C, Corbett J et al. Megastore: Providing scalable, highly available storage for interactive services. In Proc. the 5th Biennial Conference on Innovative Data Systems Research, January 2011, pp.223-234.

  14. Corbett J, Dean J, Epstein M et al. Spanner: Google’s globally-distributed database. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, October 2012, pp.251-264.

  15. Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113.

    Article  Google Scholar 

  16. Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010, pp.1-10.

  17. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: A not-so-foreign language for data processing. In Proc. the 2008 ACM SIGMOD International Conference on Management of Data, June 2008, pp.1099-1110.

  18. Chaiken R, Jenkins B, Larson P, Ramsey B, Shakib D, Weaver S, Zhou J. SCOPE: Easy and efficient parallel processing of massive data sets. In Proc. the 34th International Conference on Very Large Data Bases, August 2008, pp.1265-1276.

  19. Thusoo A, Sarma J, Jain N et al. Hive: A warehousing solution over a map-reduce framework. In Proc. the 35th International Conference on Very Large Data Bases, August 2009, pp.1626-1629.

  20. Cohen J, Dolan B, Dunlap M, Hellerstein J M, Welton C. MAD skills: New analysis practices for big data. In Proc. the 35th International Conference on Very Large Data Bases, August 2009, pp.1481-1492.

  21. Abouzeid A, Bajda-Pawlikowski K, Abadi D et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In Proc. the 35th International Conference on Very Large Data Bases, August 2009, pp.922-933.

  22. Abadi D, Madden S, Hachem N. Column-stores vs. rowstores: How different are they really? In Proc. the 2008 ACM SIGMOD International Conference on Management of Data, June 2008, pp.967-980.

  23. Ramamurthy R, DeWitt D, Su Q. A case for fractured mirrors. In Proc. the 29th International Conference on Very Large Data Bases, August 2003, pp.89-101.

  24. Grund M, Kr¨uger J, Plattner H, Zeier A, Cudre-Mauroux P, Madden S. HYRISE: A main memory hybrid storage engine. In Proc. the 36th International Conference on Very Large Data Bases, November 2010, pp.105-116.

  25. Hankins R, Patel J. Data morphing: An adaptive, cacheconscious storage technique. In Proc. the 29th International Conference on Very Large Data Bases, September 2003, pp.417-428.

  26. Abadi D, Madden S, Ferreira M. Integrating compression and execution in column-oriented database systems. In Proc. the 2006 ACM SIGMOD International Conference on Management of Data, June 2006, pp.671-682.

  27. Boncz P, Grust T, Keulen M, Manegold S, Rittinger J, Teubner J. MonetDB/XQuery: A fast XQuery processor powered by a relational engine. In Proc. the 2006 ACM SIGMOD International Conference on Management of Data, June 2006, pp.479-490.

  28. Manegold S, Kersten M L, Boncz P. Database architecture evolution: Mammals flourished long before dinosaurs became extinct. In Proc. the 35th International Conference on Very Large Data Bases, August 2009, pp.1648-1653.

  29. Ailamaki A, DeWitt D, Hill M, Skounakis M. Weaving relations for cache performance. In Proc. the 27th International Conference on Very Large Data Bases, September 2001, pp.169-180.

  30. Poess M, Nambiar R. Large scale data warehouses on grid: Oracle database 10g and HP proliant servers. In Proc. the 31st International Conference on Very Large Data Bases, September 2005, pp.1055-1066.

  31. Gibson G, Van Meter R. Network attached storage architecture. Communications of the ACM, 2000, 43(11): 37-45.

    Article  Google Scholar 

  32. Bridge W, Joshi A, Keihl M et al. The Oracle universal server buffer. In Proc. the 23rd International Conference on Very Large Data Bases, August 1997, pp.590-594.

  33. Lahiri T, Srihari V, Chan W et al. Cache fusion: Extending shared-disk clusters with shared caches. In Proc. the 27th International Conference on Very Large Data Bases, September 2001, pp.683-686.

  34. Birman K. The promise, and limitations, of gossip protocols. ACM SIGOPS Operation System Review, 2007, 41(5): 8-13.

    Article  Google Scholar 

  35. Welsh M, Culler D, Brewer E. SEDA: An architecture for well-conditioned, scalable internet services. In Proc. the 18th ACM Symposium on Operating Systems Principles, October 2001, pp.230-243.

  36. Harizopoulos S, Ailamaki A. A case for staged database systems. In Proc. the 1st Biennial Conference on Innovative Data Systems Research, January 2003, pp.101-112.

  37. Condie T, Conway N, Alvaro P et al. Online aggregation and continuous query support in MapReduce. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.1115-1118.

  38. Verma A, Cho B, Zea N et al. Breaking the MapReduce stage barrier. Cluster Computing, 2013, 16(1): 191-206.

    Article  Google Scholar 

  39. Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review, 2007, 41(3): 59-72.

    Article  Google Scholar 

  40. Thusoo A, Shao Z, Anthony S et al. Data warehousing and analytics infrastructure at Facebook. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.1013-1020.

  41. Thomson A, Diamond T, Weng S et al. Calvin: Fast distributed transactions for partitioned database systems. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.1-12.

  42. Gummadi K, Gummadi R, Gribble S et al. The impact of DHT routing geometry on resilience and proximity. In Proc. the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, August 2003, pp.381-394.

  43. Karger D, Lehman E, Leighton T et al. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proc. the 29th Annual ACM Symposium on Theory of Computing, May 1997, pp.654-663.

  44. Alvisi L, Malkhi D, Pierce E et al. Fault detection for Byzantine quorum systems. IEEE Transaction Parallel Distribute System, 2001, 12(9): 996-1007.

    Article  Google Scholar 

  45. Shute J, Vingralek R, Samwel B et al. F1: A distributed SQL database that scales. In Proc. the 39th International Conference on Very Large Data Bases, August 2013, pp.1068-1079.

  46. Zaharia M, Chowdhury M, Franklin M J et al. Spark: Cluster computing with working sets. In Proc. the 2nd USENIX Conference on Hot Topics in Cloud Computing, June 2010, pp.10-12.

  47. Xin R, Roser J, ZahariaMet al. Shark: SQL and rich analytics at scale. In Proc. the 2013 ACM SIGMOD International Conference on Management of Data, June 2013, pp.13-24.

  48. Melnik S, Gubarev A, Long J et al. Dremel: Interactive analysis of web-scale datasets. In Proc. the 36th International Conference on Very Large Data Bases, September 2010, pp.330-339.

  49. Vishal S, F¨arber F, Lehner W et al. Efficient transaction processing in SAP HANA database: The end of a column store myth. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.731-742.

  50. Lewis P, Bernstein A, Kifer M. Databases and Transaction Processing: An Application-Oriented Approach. Addison-Wesley Longman Publishing, 2002, pp.764-773.

  51. Berenson H, Bernstein P, Gray J et al. A critique of ANSI SQL isolation levels. In Proc. the 1995 ACM SIGMOD International Conference on Management of Data, May 1995, pp.1-10.

  52. Bornea M, Hodson O, Elnikety S, Fekete A. One-copy serializability with snapshot isolation under the hood. In Proc. the 27th IEEE International Conference on Data Engineering, April 2011, pp.625-636.

  53. Lin Y, Kemme B, Patiño-Martíne M et al. Middleware based data replication providing snapshot isolation. In Proc. the 2005 ACM SIGMOD International Conference on Management of Data, June 2005, pp.419-430.

  54. Pritchett D. BASE: An acid alternative. ACM Queue, 2008, 6(3): 48-55.

    Article  Google Scholar 

  55. Lloyd W, Freedman M, Kaminsky M et al. Don’t settle for eventual: Scalable causal consistency for wide-area storage with COPS. In Proc. the 23rd ACM Symposium on Operating Systems Principles, October 2011, pp.401-416.

  56. Roh H, Jeon M, Kim J et al. Replicated abstract data types: Building blocks for collaborative applications. Journal of Parallel and Distributed Computing, 2011, 71(3): 354-368.

    Article  MATH  Google Scholar 

  57. Shapiro M, Preguiça N, Baquero C, Zawirski M. Conflict free replicated data types. In Proc. the 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems, October 2011, pp.386-400.

  58. Vogels W. Eventually consistent. ACM Queue, 2008, 6(6): 14-19.

    Article  Google Scholar 

  59. Ahamad M, Neiger G, Burns J et al. Causal memory: Definitions, implementation, and programming. Distributed Computing, 1995, 9(1): 37-49.

    Article  MathSciNet  Google Scholar 

  60. Bailis P, Ghodsi A, Hellerstein J et al. Bolt-on causal consistency. In Proc. the 2013 ACM SIGMOD International Conference on Management of Data, June 2013, pp.761-772.

  61. Lloyd W, Freedman M, Kaminsky M et al. Stronger semantics for low-latency geo-replicated storage. In Proc. the 10th USENIX Symposium on Networked Systems Design and Implementation, April 2013, pp.313-328.

  62. Saito Y, Shapiro M. Optimistic replication. ACM Computer Survey, 2005, 37(1): 42-81.

    Article  Google Scholar 

  63. Burckhardt S, Leijen D, F¨ahndrich M et al. Eventually consistent transactions. In Proc. the 21st European Conference on Programming Languages and Systems, March 2012, pp.67-86.

  64. Bailis P, Venkataraman S, Franklin M J, Hellerstein J M, Stoica I. Probabilistically bounded staleness for practical partial quorums. In Proc. the 38th European Conference on Programming Languages and Systems, April 2012, pp.776-787.

  65. Cipar J, Ganger G, Keeton K et al. LazyBase: Trading freshness for performance in a scalable database. In Proc. the 7th ACM European Conference on Computer Systems, April 2012, pp.169-182.

  66. Rao J, Shekita E J, Tata S. Using Paxos to build a scalable, consistent, and highly available datastore. In Proc. the 37th International Conference on Very Large Data Bases, January 2011, pp.243-254.

  67. Chandra T D, Griesemer R, Redstone J. Paxos made live: An engineering perspective. In Proc. the 26th Annual ACM Symposium on Principles of Distributed Computing, August 2007, pp.398-407.

  68. Kraska T, Pang G, Franklin M, Madden S, Fekete A. MDCC: Multi-data center consistency. In Proc. the 8th ACM European Conference on Computer Systems, April 2013, pp.113-126.

  69. Patterson S, Elmore A J, Nawab F et al. Serializability, not serial: Concurrency control and availability in multi-datacenter datastores. In Proc. the 38th International Conference on Very Large Data Bases, July 2012, pp.1459-1470.

  70. Burrows M. The chubby lock service for loosely-coupled distributed systems. In Proc. the 7th Symposium on Operating Systems Design and Implementation, November 2006, pp.335-350.

  71. Hunt P, Konar M, Junqueira F et al. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proc. the 2010 USENIX Conference on USENIX Annual Technical Conference, June 2010, p.11.

  72. Jones E, Abadi D, Madden S. Low overhead concurrency control for partitioned main memory databases. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.603-614.

  73. Stonebraker M, Madden S, Abadi D et al. The end of an architectural era: (It’s time for a complete rewrite). In Proc. the 33rd International Conference on Very Large Data Bases, September 2007, pp.1150-1160.

  74. Escriva R, Wong B, Sirer E. HyperDex: A distributed, searchable key-value store. SIGCOMM Computer Communication Review, 2012, 42(4): 25-36.

  75. Kossmann D, Kraskan T, Loesing S. An evaluation of alternative architectures for transaction processing in the cloud. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.579-590.

  76. Xu Y, Kostamaa P, Zhou X, Chen L. Handling data skew in parallel joins in shared-nothing systems. In Proc. the 2008 ACM SIGMOD International Conference on Management of Data, June 2008, pp.1043-1052.

  77. Johnson R, Pandis I, Hardavellas N et al. Shore-MT: A scalable storage manager for the multi-core era. In Proc. the 12th International Conference on Extending Database Technology: Advances in Database Technology, March 2009, pp.24-35.

  78. Larson P, Blanas S, Diaconu C et al. High-performance concurrency control mechanisms for main-memory databases. In Proc. the 37th International Conference on Very Large Data Bases, August 2011, pp.298-309.

  79. ¨ Ozsu M, Valduriez P. Principles of Distributed Database Systems (3rd edition). Springer, 2011, pp.387-394.

  80. Weikum G, Vossen G. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, 2001, pp.676-686.

  81. Helland P. Life beyond distributed transactions: An apostate’s opinion. In Proc. the 3rd Biennial Conference on Innovative Data Systems Research, January 2007, pp.132-141.

  82. Yu H, Vahdat A. Minimal replication cost for availability. In Proc. the 21st Annual Symposium on Principles of Distributed Computing, July 2002, pp.98-107.

  83. Yu H, Vahdat A. The costs and limits of availability for replicated services. ACM Transaction Computer System, 2006, 24(1): 70-113.

    Article  Google Scholar 

  84. Aguilera M, Merchant A, Shah M et al. Sinfonia: A new paradigm for building scalable distributed systems. ACM Transaction Computer System, 2009, 27(3): Article No. 5.

  85. Wu L, Yuan L, You J. BASIC, an alternative to BASE for large-scale data management system. In Proc. the 2014 IEEE International Conference on Big Data, Nov. 2014.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lengdong Wu.

Additional information

The work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grant No. RES0001375, and Shanghai Shifang Company under Grant No. RES0012485.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, L., Yuan, L. & You, J. Survey of Large-Scale Data Management Systems for Big Data Applications. J. Comput. Sci. Technol. 30, 163–183 (2015). https://doi.org/10.1007/s11390-015-1511-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-015-1511-8

Keywords

Navigation