ABSTRACT
NoSQL databases are increasingly used in big data applications, because they achieve fast write throughput and fast lookups on the primary key. Many of these applications also require queries on non-primary attributes. For that reason, several NoSQL databases have added support for secondary indexes. However, these works are fragmented, as each system generally supports one type of secondary index, and may be using different names or no name at all to refer to such indexes. As there is no single system that supports all types of secondary indexes, no experimental head-to-head comparison or performance analysis of the various secondary indexing techniques in terms of throughput and space exists. In this paper, we present a taxonomy of NoSQL secondary indexes, broadly split into two classes: Embedded Indexes (i.e. lightweight filters embedded inside the primary table) and Stand-Alone Indexes (i.e. separate data structures). To ensure the fairness of our comparative study, we built a system, LevelDB++, on top of Google's popular open-source LevelDB key-value store. There, we implemented two Embedded Indexes and three state-of-the-art Stand-Alone indexes, which cover most of the popular NoSQL databases. Our comprehensive experimental study and theoretical evaluation show that none of these indexing techniques dominate the others: the embedded indexes offer superior write throughput and are more space efficient, whereas the stand-alone secondary indexes achieve faster query response times. Thus, the optimal choice of secondary index depends on the application workload. This paper provides an empirical guideline for choosing secondary indexes
- Parag Agrawal, Adam Silberstein, Brian F Cooper, Utkarsh Srivastava, and Raghu Ramakrishnan. 2009. Asynchronous view maintenance for VLSD databases. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 179--192. Google ScholarDigital Library
- Sattam Alsubaiee, Alexander Behm, Vinayak Borkar, Zachary Heilbron, Young-Seok Kim, Michael J Carey, Markus Dreseler, and Chen Li. 2014. Storage Management in AsterixDB. Proceedings of the VLDB Endowment 7, 10 (2014). Google ScholarDigital Library
- Basho. 2017. Secondary Indexes in Riak. (October 2017). http://basho.com/posts/ technical/secondary-indexes-in-riak.Google Scholar
- Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wal- lach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. TOCS 26, 2 (2008), 4. Google ScholarDigital Library
- Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143--154. Google ScholarDigital Library
- James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. 2013. Spanner: Google's globally distributed database. ACM Transactions on Computer Systems (TOCS) 31, 3 (2013), 8. Google ScholarDigital Library
- Debraj De and Lifeng Sang. 2009. QoS supported efficient clustered query processing in large collaboration of heterogeneous sensor networks. In Collaborative Technologies and Systems, 2009. CTS'09. International Symposium on. IEEE, 242-- 249. Google ScholarDigital Library
- Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. 2017. Optimizing Space Amplification in RocksDB.. In CIDR .Google Scholar
- Robert Escriva, Bernard Wong, and Emin Gün Sirer. 2012. HyperDex: A distributed, searchable key-value store. ACM SIGCOMM Computer Communication Review 42, 4 (2012), 25--36. Google ScholarDigital Library
- Facebook. 2017. Live Commenting: Behind the Scenes. (Oc- tober 2017). https://code.facebook.com/posts/557771457592035/ live-commenting-behind-the-scenes/.Google Scholar
- A Feinberg. 2011. Project Voldemort: Reliable distributed storage. In Proceedings of the 10th IEEE International Conference on Data Engineering.Google Scholar
- Lars George. 2011. HBase: the definitive guide. O'Reilly Media, Inc.Google Scholar
- Lei Guo, Dejun Teng, Rubao Lee, Feng Chen, Siyuan Ma, and Xiaodong Zhang. 2016. Re-enabling high-speed caching for LSM-trees. arXiv preprint arXiv:1606.02015 (2016).Google Scholar
- Yuan He, Mo Li, and Yunhao Liu. 2008. Collaborative Query Processing Among Heterogeneous Sensor Networks. In Proceedings of the 1st ACM International Workshop on Heterogeneous Sensor and Actor Networks (HeterSanet '08). ACM, New York, NY, USA, 25--30. Google ScholarDigital Library
- Todd Hoff. 2016. The Architecture Twitter Uses To Deal With 150M Active Users, 300K QPS, A 22 MB/S Firehose, And Send Tweets In Under 5 Seconds. (2016). http://highscalability.com/blog/2013/7/8/ the-architecture-twitter-uses-to-deal-with-150m-active-users.html.Google Scholar
- Aerospike inc. 2017. Aerospike Secondary Index Architecture. (October 2017). https://www.aerospike.com/docs/architecture/secondary-index.html.Google Scholar
- Amazon Inc. 2017. Global Secondary Indexes - Amazon DynamoDB. (October 2017). http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html.Google Scholar
- CouchDB Inc. 2017. CouchDB. (October 2017). http://couchdb.apache.org/.Google Scholar
- Facebook Inc. 2017. RocksDB. (October 2017). http://rocksdb.org/.Google Scholar
- Facebook Inc. 2017. Strategies to reduce write amplification. (October 2017). https://github.com/facebook/rocksdb/issues/19.Google Scholar
- Google Inc. 2017. Google Snappy. (October 2017). http://google.github.io/snappy.Google Scholar
- Google Inc. 2017. LevelDB. (October 2017). http://leveldb.org.Google Scholar
- IBM Inc. 2017. IBM Big Data Analytics. (October 2017). https://www.ibm.com/ analytics/us/en/big-data/.Google Scholar
- IBM inc. 2017. Understanding Netezza Zone Maps. (October 2017). https://www.ibm.com/developerworks/community/blogs/Wce085e09749a_4650_a064_bb3f3b738fa3/entry/understanding_netezza_zone_maps?lang=en.Google Scholar
- MongoDB Inc. 2017. MongoDB. (October 2017). http://www.mongodb.com.Google Scholar
- Oracle Inc. 2017. Oracle: Using Zone Maps. (October 2017). http://docs.oracle. com/database/121/DWHSG/zone_maps.htm.Google Scholar
- Teradata Inc. 2017. Teradata Teradata Analytics for Enterprise Applications. (October 2017). http://www.teradata.com/analyticssolutions.Google Scholar
- Bettina Kemme and Gustavo Alonso. 2010. Database Replication: A Tale of Research Across Communities. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 5--12. Google ScholarDigital Library
- UCR Database Lab. 2017. Project website for open source code and workload generator. (October 2017). http://dblab.cs.ucr.edu/projects/KeyValueIndexes/.Google Scholar
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev. 44, 2 (apr 2010), 35--40. Google ScholarDigital Library
- Lucas Lersch, Ismail Oukid, Wolfgang Lehner, and Ivan Schreter. 2017. An analysis of LSM caching in NVRAM. In Proceedings of the 13th International Workshop on Data Management on New Hardware. ACM, 9. Google ScholarDigital Library
- Mahdi Tayarani Najaran and Norman C Hutchinson. 2013. Innesto: A searchable key/value store for highly dimensional data. In CloudCom. IEEE, 411--420. Google ScholarDigital Library
- Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351--385. Google ScholarDigital Library
- Mohiuddin Abdul Qader and Vagelis Hristidis. 2017. Dualdb: An efficient lsm-based publish/subscribe storage system. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 24. Google ScholarDigital Library
- Wei Tan, Sandeep Tata, Yuzhe Tang, and Liana Fong. 2014. Diff-Index: Differenti- ated Index in Distributed Log-Structured Data Stores. In EDBT. 700--711.Google Scholar
- Jianjun Zheng, Qian Lin, Jiatao Xu, Cheng Wei, Chuwei Zeng, Pingan Yang, and Yunfan Zhang. 2017. PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment 10, 12 (2017), 1730--1741. Google ScholarDigital Library
Index Terms
- A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases
Recommendations
Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems
LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, ...
LSM-Trees and B-Trees: The Best of Both Worlds
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataLSM-Trees and B-Trees are the two primary data structures used as storage engines in modern key-value (KV) stores. These two structures are optimal for different workloads; LSM-Trees perform better on update queries, whereas B-Trees are preferable for ...
Multi-core Adaptive Merging of the Secondary Index for LSM-Based Stores
Database and Expert Systems ApplicationsAbstractNoSQL databases have gained great popularity recently. Most of them use the Log Structured Merge (LSM) tree which provides fast write throughput and fast lookup of primary keys. Nevertheless, searching by non-key attributes is very slow because ...
Comments