skip to main content
10.1145/3183713.3196900acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

Published:27 May 2018Publication History

ABSTRACT

NoSQL databases are increasingly used in big data applications, because they achieve fast write throughput and fast lookups on the primary key. Many of these applications also require queries on non-primary attributes. For that reason, several NoSQL databases have added support for secondary indexes. However, these works are fragmented, as each system generally supports one type of secondary index, and may be using different names or no name at all to refer to such indexes. As there is no single system that supports all types of secondary indexes, no experimental head-to-head comparison or performance analysis of the various secondary indexing techniques in terms of throughput and space exists. In this paper, we present a taxonomy of NoSQL secondary indexes, broadly split into two classes: Embedded Indexes (i.e. lightweight filters embedded inside the primary table) and Stand-Alone Indexes (i.e. separate data structures). To ensure the fairness of our comparative study, we built a system, LevelDB++, on top of Google's popular open-source LevelDB key-value store. There, we implemented two Embedded Indexes and three state-of-the-art Stand-Alone indexes, which cover most of the popular NoSQL databases. Our comprehensive experimental study and theoretical evaluation show that none of these indexing techniques dominate the others: the embedded indexes offer superior write throughput and are more space efficient, whereas the stand-alone secondary indexes achieve faster query response times. Thus, the optimal choice of secondary index depends on the application workload. This paper provides an empirical guideline for choosing secondary indexes

References

  1. Parag Agrawal, Adam Silberstein, Brian F Cooper, Utkarsh Srivastava, and Raghu Ramakrishnan. 2009. Asynchronous view maintenance for VLSD databases. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 179--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sattam Alsubaiee, Alexander Behm, Vinayak Borkar, Zachary Heilbron, Young-Seok Kim, Michael J Carey, Markus Dreseler, and Chen Li. 2014. Storage Management in AsterixDB. Proceedings of the VLDB Endowment 7, 10 (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Basho. 2017. Secondary Indexes in Riak. (October 2017). http://basho.com/posts/ technical/secondary-indexes-in-riak.Google ScholarGoogle Scholar
  4. Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wal- lach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. TOCS 26, 2 (2008), 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. 2013. Spanner: Google's globally distributed database. ACM Transactions on Computer Systems (TOCS) 31, 3 (2013), 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Debraj De and Lifeng Sang. 2009. QoS supported efficient clustered query processing in large collaboration of heterogeneous sensor networks. In Collaborative Technologies and Systems, 2009. CTS'09. International Symposium on. IEEE, 242-- 249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. 2017. Optimizing Space Amplification in RocksDB.. In CIDR .Google ScholarGoogle Scholar
  10. Robert Escriva, Bernard Wong, and Emin Gün Sirer. 2012. HyperDex: A distributed, searchable key-value store. ACM SIGCOMM Computer Communication Review 42, 4 (2012), 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Facebook. 2017. Live Commenting: Behind the Scenes. (Oc- tober 2017). https://code.facebook.com/posts/557771457592035/ live-commenting-behind-the-scenes/.Google ScholarGoogle Scholar
  12. A Feinberg. 2011. Project Voldemort: Reliable distributed storage. In Proceedings of the 10th IEEE International Conference on Data Engineering.Google ScholarGoogle Scholar
  13. Lars George. 2011. HBase: the definitive guide. O'Reilly Media, Inc.Google ScholarGoogle Scholar
  14. Lei Guo, Dejun Teng, Rubao Lee, Feng Chen, Siyuan Ma, and Xiaodong Zhang. 2016. Re-enabling high-speed caching for LSM-trees. arXiv preprint arXiv:1606.02015 (2016).Google ScholarGoogle Scholar
  15. Yuan He, Mo Li, and Yunhao Liu. 2008. Collaborative Query Processing Among Heterogeneous Sensor Networks. In Proceedings of the 1st ACM International Workshop on Heterogeneous Sensor and Actor Networks (HeterSanet '08). ACM, New York, NY, USA, 25--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Todd Hoff. 2016. The Architecture Twitter Uses To Deal With 150M Active Users, 300K QPS, A 22 MB/S Firehose, And Send Tweets In Under 5 Seconds. (2016). http://highscalability.com/blog/2013/7/8/ the-architecture-twitter-uses-to-deal-with-150m-active-users.html.Google ScholarGoogle Scholar
  17. Aerospike inc. 2017. Aerospike Secondary Index Architecture. (October 2017). https://www.aerospike.com/docs/architecture/secondary-index.html.Google ScholarGoogle Scholar
  18. Amazon Inc. 2017. Global Secondary Indexes - Amazon DynamoDB. (October 2017). http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html.Google ScholarGoogle Scholar
  19. CouchDB Inc. 2017. CouchDB. (October 2017). http://couchdb.apache.org/.Google ScholarGoogle Scholar
  20. Facebook Inc. 2017. RocksDB. (October 2017). http://rocksdb.org/.Google ScholarGoogle Scholar
  21. Facebook Inc. 2017. Strategies to reduce write amplification. (October 2017). https://github.com/facebook/rocksdb/issues/19.Google ScholarGoogle Scholar
  22. Google Inc. 2017. Google Snappy. (October 2017). http://google.github.io/snappy.Google ScholarGoogle Scholar
  23. Google Inc. 2017. LevelDB. (October 2017). http://leveldb.org.Google ScholarGoogle Scholar
  24. IBM Inc. 2017. IBM Big Data Analytics. (October 2017). https://www.ibm.com/ analytics/us/en/big-data/.Google ScholarGoogle Scholar
  25. IBM inc. 2017. Understanding Netezza Zone Maps. (October 2017). https://www.ibm.com/developerworks/community/blogs/Wce085e09749a_4650_a064_bb3f3b738fa3/entry/understanding_netezza_zone_maps?lang=en.Google ScholarGoogle Scholar
  26. MongoDB Inc. 2017. MongoDB. (October 2017). http://www.mongodb.com.Google ScholarGoogle Scholar
  27. Oracle Inc. 2017. Oracle: Using Zone Maps. (October 2017). http://docs.oracle. com/database/121/DWHSG/zone_maps.htm.Google ScholarGoogle Scholar
  28. Teradata Inc. 2017. Teradata Teradata Analytics for Enterprise Applications. (October 2017). http://www.teradata.com/analyticssolutions.Google ScholarGoogle Scholar
  29. Bettina Kemme and Gustavo Alonso. 2010. Database Replication: A Tale of Research Across Communities. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 5--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. UCR Database Lab. 2017. Project website for open source code and workload generator. (October 2017). http://dblab.cs.ucr.edu/projects/KeyValueIndexes/.Google ScholarGoogle Scholar
  31. Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev. 44, 2 (apr 2010), 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lucas Lersch, Ismail Oukid, Wolfgang Lehner, and Ivan Schreter. 2017. An analysis of LSM caching in NVRAM. In Proceedings of the 13th International Workshop on Data Management on New Hardware. ACM, 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mahdi Tayarani Najaran and Norman C Hutchinson. 2013. Innesto: A searchable key/value store for highly dimensional data. In CloudCom. IEEE, 411--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mohiuddin Abdul Qader and Vagelis Hristidis. 2017. Dualdb: An efficient lsm-based publish/subscribe storage system. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Wei Tan, Sandeep Tata, Yuzhe Tang, and Liana Fong. 2014. Diff-Index: Differenti- ated Index in Distributed Log-Structured Data Stores. In EDBT. 700--711.Google ScholarGoogle Scholar
  37. Jianjun Zheng, Qian Lin, Jiatao Xu, Cheng Wei, Chuwei Zeng, Pingan Yang, and Yunfan Zhang. 2017. PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment 10, 12 (2017), 1730--1741. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
        May 2018
        1874 pages
        ISBN:9781450347037
        DOI:10.1145/3183713

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 May 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader