Skip to main content

Advertisement

Log in

A distributed in-memory key-value store system on heterogeneous CPU–GPU cluster

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In-memory key-value stores play a critical role in many data-intensive applications to provide high-throughput and low latency data accesses. In-memory key-value stores have several unique properties that include (1) data-intensive operations demanding high memory bandwidth for fast data accesses, (2) high data parallelism and simple computing operations demanding many slim parallel computing units, and (3) a large working set. However, our experiments show that homogeneous multicore CPU systems are increasingly mismatched to the special properties of key-value stores because they do not provide massive data parallelism and high memory bandwidth; the powerful but the limited number of computing cores does not satisfy the demand of the unique data processing task; and the cache hierarchy may not well benefit to the large working set. In this paper, we present the design and implementation of Mega-KV, a distributed in-memory key-value store system on a heterogeneous CPU–GPU cluster. Effectively utilizing the high memory bandwidth and latency hiding capability of GPUs, Mega-KV provides fast data accesses and significantly boosts overall performance and energy efficiency over the homogeneous CPU architectures. Mega-KV shows excellent scalability and processes up to 623-million key-value operations per second on a cluster installed with eight CPUs and eight GPUs, while delivering an efficiency of up to 299-thousand operations per Watt (KOPS/W).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. CPU Frequency Scaling. https://wiki.archlinux.org/index.php/CPU_frequency_scaling/

  2. Intel dpdk. http://dpdk.org/

  3. Memcached. http://memcached.org/

  4. Nvidia management library. https://developer.nvidia.com/nvidia-management-library-nvml/

  5. Redis. http://redis.io/

  6. Andersen, D.G., Franklin, J., Kaminsky, M., Phanishayee, A., Tan, L., Vasudevan, V.: Fawn: a fast array of wimpy nodes. In: SOSP, pp. 1–14 (2009)

  7. Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., Paleczny, M.: Workload analysis of a large-scale key-value store. In: SIGMETRICS, pp. 53–64 (2012)

  8. Berezecki, M., Frachtenberg, E., Paleczny, M., Steele, K.: Many-core key-value store. In: IGCC, pp. 1–8 (2011)

  9. Chalamalasetti, S.R., Lim, K., Wright, M., AuYoung, A., Ranganathan, P., Margala, M.: An FPGA memcached appliance. In: FPGA, pp. 245–254 (2013)

  10. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: SoCC, pp. 143–154 (2010)

  11. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithm, 3rd edn. The MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  12. Erlingsson, Ú., Manasse, M., McSherry, F.: A cool and practical alternative to traditional hash tables. In: WDAS, pp. 1–6 (2006)

  13. Escriva, R., Wong, B., Sirer, E.G.: Hyperdex: a distributed, searchable key-value store. ACM SIGCOMM Comput. Commun. Rev. 42(4), 25–36 (2012)

    Article  Google Scholar 

  14. Fan, B., Andersen, D.G., Kaminsky, M.: Memc3: Compact and concurrent memcache with dumber caching and smarter hashing. In: NSDI, pp. 371–384 (2013)

  15. Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A., Ailamaki, A., Falsafi, B.: A case for specialized processors for scale-out workloads. In: Micro, pp. 31–42 (2014)

  16. Geambasu, R., Levy, A.A., Kohno, T., Krishnamurthy, A., Levy, H.M.: Comet: an active distributed key-value store. In: OSDI (2010)

  17. Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. In: SIGMOD, pp. 243–252 (1994)

  18. Gutierrez, A., Cieslak, M., Giridhar, B., Dreslinski, R.G., Ceze, L., Mudge, T.: Integrated 3d-stacked server designs for increasing physical density of key-value stores. In: ASPLOS, pp. 485–498 (2014)

  19. Han, S., Jang, K., Park, K., Moon, S.: Packetshader: A GPU-accelerated software router. In: SIGCOMM, pp. 195–206 (2010)

  20. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008)

  21. He, B., Yu, J.X.: High-throughput transaction executions on graphics processors. In: PVLDB (2011)

  22. Heimel, M., Saecker, M., Pirk, H., Manegold, S., Markl, V.: Hardware-oblivious parallelism for in-memory column-stores. In: PVLDB, pp. 709–720 (2013)

  23. Hetherington, T., Rogers, T., Hsu, L., O’Connor, M., Aamodt, T.: Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems. In: ISPASS, pp. 88–98 (2012)

  24. Hetherington, T.H., O’Connor, M., Aamodt, T.M.: Memcachedgpu: Scaling-up scale-out key-value stores. In: SoCC, pp. 43–57 (2015)

  25. Hong, S., Kim, H.: An integrated GPU power and performance model. ACM SIGARCH Comput. Archit. News 38(3), 280–289 (2010)

    Article  Google Scholar 

  26. Jeong, E.Y., Woo, S., Jamshed, M., Jeong, H., Ihm, S., Han, D., Park, K.: mTCP: A highly scalable user-level tcp stack for multicore systems. In: NSDI (2014)

  27. Kaldewey, T., Lohman, G., Mueller, R., Volk, P.: GPU join processing revisited. In: DaMoN, pp. 55–62 (2012)

  28. Kapoor, R., Porter, G., Tewari, M., Voelker, G.M., Vahdat, A.: Chronos: predictable low latency for data center applications. In: SoCC, pp. 9:1–9:14 (2012)

  29. Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: STOC, pp. 654–663 (1997)

  30. Lavasani, M., Angepat, H., Chiou, D.: An fpga-based in-line accelerator for memcached. Comput. Archit. Lett. 13(2), 57–60 (2013)

    Article  Google Scholar 

  31. Lee, J., Sathisha, V., Schulte, M., Compton, K., Kim, N. S.: Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling. In: 2011 International Conference on Parallel Architectures and Compilation Techniques. Galveston, TX, pp. 111–120 (2011)

  32. Leng, J., Hetherington, T., Eltantawy, A., Gilani, S., Kim, N.S., Aamodt, T.M., Reddi, V.J.: GPUWattch: enabling energy optimizations in GPGPUs. ACM SIGARCH Comput. Archit. News 41(3), 487–498 (2013)

    Article  Google Scholar 

  33. Leng, T., Ali, R., Hsieh, J., Mashayekhi, V., Rooholamini, R.: An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution (2002)

  34. Li, C., Cox, A.L.: Gd-wheel: A cost-aware replacement policy for key-value stores. In: Proceedings of the Tenth European Conference on Computer Systems, EuroSys (2015)

  35. Li, S., Lim, H., Lee, V.W., Ahn, J.H., Kalia, A., Kaminsky, M., Andersen, D.G., Seongil, O., Lee, S., Dubey, P.: Architecting to achieve a billion requests per second throughput on a single key-value store server platform. In: ISCA, pp. 476–488 (2015)

  36. Lim, H., Han, D., Andersen, D.G., Kaminsky, M.: Mica: A holistic approach to fast in-memory key-value storage. In: NSDI, pp. 429–444 (2014)

  37. Lim, K., Meisner, D., Saidi, A.G., Ranganathan, P., Wenisch, T.F.: Thin servers with smart pipes: designing soc accelerators for memcached. SIGARCH Comput. Archit. News 41, 36–47 (2013)

    Article  Google Scholar 

  38. Ma, K., Li, X., Chen W., Zhang, C., Wang, X.: Green GPU: a holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In: 2012 41st International Conference on Parallel Processing. Pittsburgh, PA, pp. 48–57 (2012)

  39. Mao, Y., Kohler, E., Morris, R.T.: Cache craftiness for fast multicore key-value storage. In: EuroSys, pp. 183–196 (2012)

  40. Metreveli, Z., Zeldovich, N., Kaashoek, M.F.: Cphash: A cache-partitioned hash table. In: PPoPP, pp. 319–320 (2012)

  41. Mitchell, C., Geng, Y., Li, J.: Using one-sided rdma reads to build a fast, CPU-efficient key-value store. In: USENIX ATC, pp. 103–114 (2013)

  42. Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H.C., McElroy, R., Paleczny, M., Peek, D., Saab, P., Stafford, D., Tung, T., Venkataramani, V.: Scaling memcache at facebook. In: NSDI, pp. 385–398 (2013)

  43. Ousterhout, J., Agrawal, P., Erickson, D., Kozyrakis, C., Leverich, J., Mazières, D., Mitra, S., Narayanan, A., Parulkar, G., Rosenblum, M., Rumble, S.M., Stratmann, E., Stutsman, R.: The case for ramclouds: scalable high-performance storage entirely in dram. SIGOPS Oper. Syst. Rev. 43, 92–105 (2010)

    Article  Google Scholar 

  44. Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51(2), 122–144 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  45. Paul, J., He, J., He, B.: GPL: A GPU-based pipelined query processing engine. In: SIGMOD, pp. 1935–1950 (2016)

  46. Pirk, H., Manegold, S., Kersten, M.: Waste not... efficient co-processing of relational data. In: ICDE, pp. 508–519 (2014)

  47. Price, D.C., Clark, M.A., Barsdell, B.R., Babich, R., Greenhill, L.J.: Optimizing performance-per-watt on GPUs in high performance computing. Comput. Sci. Res. Dev. 31(4), 185–193 (2016)

    Article  Google Scholar 

  48. Richter, S., Alvarez, V., Dittrich, J.: A seven-dimensional analysis of hashing methods and its implications on query processing. In: VLDB, pp. 96–107 (2015)

  49. Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W.: Optimization principles and application performance evaluation of a multithreaded GPU using cuda. In: PPoPP, pp. 73–82 (2008)

  50. Tu, S., Zheng, W., Kohler, E., Liskov, B., Madden, S.: Speedy transactions in multicore in-memory databases. In: SOSP (2013)

  51. Wang, K., Ding, X., Lee, R., Kato, S., Zhang, X.: Gdm: Device memory management for GPGPU computing. In: SIGMETRICS, pp. 533–545 (2014)

  52. Wang, K., Zhang, K., Yuan, Y., Ma, S., Lee, R., Ding, X., Zhang, X.: Concurrent analytical query processing with GPUs. In: PVLDB, pp. 1011–1022 (2014)

  53. Wu, H., Diamos, G., Cadambi, S., Yalamanchili, S.: Kernel weaver: Automatically fusing database primitives for efficient GPU computation. In: MICRO, pp. 107–118 (2012)

  54. Yuan, Y., Lee, R., Zhang, X.: The yin and yang of processing data warehousing queries on GPU devices. In: PVLDB, pp. 817–828 (2013)

  55. Zhang, K., Chen, F., Ding, X., Huai, Y., Lee, R., Luo, T., Wang, K., Yuan, Y., Zhang, X.: Hetero-db: next generation high-performance database systems by best utilizing heterogeneous computing and storage resources. JCST 30, 657–678 (2015)

    Google Scholar 

  56. Zhang, K., Hu, J., He, B., Hua, B.: Dido: Dynamic pipelines for in-memory key-value stores on coupled CPU-GPU architectures. In: ICDE, pp. 671–682 (2017)

  57. Zhang, K., Hu, J., Hua, B.: A holistic approach to build real-time stream processing system with GPU. JPDC 83(C), 44–57 (2015)

    Google Scholar 

  58. Zhang, K., Wang, K., Yuan, Y., Guo, L., Lee, R., Zhang, X.: Mega-kv: a case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB Endow. 8, 1226–1237 (2015)

    Article  Google Scholar 

  59. Zhou, J., Ross, K.A.: Buffering accesses to memory-resident index structures. In: VLDB, pp. 405–416 (2003)

Download references

Acknowledgements

This work has been partially supported by the US National Science Foundation under grants CCF-1513944, CCF-1629403, and IIS-1718450, as well as a MoE AcRF Tier 1 grant (T1 251RES1610) and a MoE AcRF Tier 2 grant (MOE2017-T2-1-122) in Singapore.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, K., Wang, K., Yuan, Y. et al. A distributed in-memory key-value store system on heterogeneous CPU–GPU cluster. The VLDB Journal 26, 729–750 (2017). https://doi.org/10.1007/s00778-017-0479-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-017-0479-0

Keywords

Navigation