research-article

Public Access

A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

Authors:
Mohiuddin Abdul Qader

University of California Riverside, Riverside, CA, USA

University of California Riverside, Riverside, CA, USA
View Profile

,
Shiwen Cheng

University of California Riverside, Riverside, CA, USA

University of California Riverside, Riverside, CA, USA
View Profile

,
Vagelis Hristidis

University of California Riverside, Riverside, CA, USA

University of California Riverside, Riverside, CA, USA
View Profile

SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataMay 2018Pages 551–566https://doi.org/10.1145/3183713.3196900

Published:27 May 2018Publication History

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 551–566

ABSTRACT

NoSQL databases are increasingly used in big data applications, because they achieve fast write throughput and fast lookups on the primary key. Many of these applications also require queries on non-primary attributes. For that reason, several NoSQL databases have added support for secondary indexes. However, these works are fragmented, as each system generally supports one type of secondary index, and may be using different names or no name at all to refer to such indexes. As there is no single system that supports all types of secondary indexes, no experimental head-to-head comparison or performance analysis of the various secondary indexing techniques in terms of throughput and space exists. In this paper, we present a taxonomy of NoSQL secondary indexes, broadly split into two classes: Embedded Indexes (i.e. lightweight filters embedded inside the primary table) and Stand-Alone Indexes (i.e. separate data structures). To ensure the fairness of our comparative study, we built a system, LevelDB++, on top of Google's popular open-source LevelDB key-value store. There, we implemented two Embedded Indexes and three state-of-the-art Stand-Alone indexes, which cover most of the popular NoSQL databases. Our comprehensive experimental study and theoretical evaluation show that none of these indexing techniques dominate the others: the embedded indexes offer superior write throughput and are more space efficient, whereas the stand-alone secondary indexes achieve faster query response times. Thus, the optimal choice of secondary index depends on the application workload. This paper provides an empirical guideline for choosing secondary indexes

References

Parag Agrawal, Adam Silberstein, Brian F Cooper, Utkarsh Srivastava, and Raghu Ramakrishnan. 2009. Asynchronous view maintenance for VLSD databases. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 179--192. Google ScholarDigital Library
Sattam Alsubaiee, Alexander Behm, Vinayak Borkar, Zachary Heilbron, Young-Seok Kim, Michael J Carey, Markus Dreseler, and Chen Li. 2014. Storage Management in AsterixDB. Proceedings of the VLDB Endowment 7, 10 (2014). Google ScholarDigital Library
Basho. 2017. Secondary Indexes in Riak. (October 2017). http://basho.com/posts/ technical/secondary-indexes-in-riak.Google Scholar
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wal- lach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. TOCS 26, 2 (2008), 4. Google ScholarDigital Library
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143--154. Google ScholarDigital Library
James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. 2013. Spanner: Google's globally distributed database. ACM Transactions on Computer Systems (TOCS) 31, 3 (2013), 8. Google ScholarDigital Library
Debraj De and Lifeng Sang. 2009. QoS supported efficient clustered query processing in large collaboration of heterogeneous sensor networks. In Collaborative Technologies and Systems, 2009. CTS'09. International Symposium on. IEEE, 242-- 249. Google ScholarDigital Library
Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. 2017. Optimizing Space Amplification in RocksDB.. In CIDR .Google Scholar
Robert Escriva, Bernard Wong, and Emin Gün Sirer. 2012. HyperDex: A distributed, searchable key-value store. ACM SIGCOMM Computer Communication Review 42, 4 (2012), 25--36. Google ScholarDigital Library
Facebook. 2017. Live Commenting: Behind the Scenes. (Oc- tober 2017). https://code.facebook.com/posts/557771457592035/ live-commenting-behind-the-scenes/.Google Scholar
A Feinberg. 2011. Project Voldemort: Reliable distributed storage. In Proceedings of the 10th IEEE International Conference on Data Engineering.Google Scholar
Lars George. 2011. HBase: the definitive guide. O'Reilly Media, Inc.Google Scholar
Lei Guo, Dejun Teng, Rubao Lee, Feng Chen, Siyuan Ma, and Xiaodong Zhang. 2016. Re-enabling high-speed caching for LSM-trees. arXiv preprint arXiv:1606.02015 (2016).Google Scholar
Yuan He, Mo Li, and Yunhao Liu. 2008. Collaborative Query Processing Among Heterogeneous Sensor Networks. In Proceedings of the 1st ACM International Workshop on Heterogeneous Sensor and Actor Networks (HeterSanet '08). ACM, New York, NY, USA, 25--30. Google ScholarDigital Library
Todd Hoff. 2016. The Architecture Twitter Uses To Deal With 150M Active Users, 300K QPS, A 22 MB/S Firehose, And Send Tweets In Under 5 Seconds. (2016). http://highscalability.com/blog/2013/7/8/ the-architecture-twitter-uses-to-deal-with-150m-active-users.html.Google Scholar
Aerospike inc. 2017. Aerospike Secondary Index Architecture. (October 2017). https://www.aerospike.com/docs/architecture/secondary-index.html.Google Scholar
Amazon Inc. 2017. Global Secondary Indexes - Amazon DynamoDB. (October 2017). http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html.Google Scholar
CouchDB Inc. 2017. CouchDB. (October 2017). http://couchdb.apache.org/.Google Scholar
Facebook Inc. 2017. RocksDB. (October 2017). http://rocksdb.org/.Google Scholar
Facebook Inc. 2017. Strategies to reduce write amplification. (October 2017). https://github.com/facebook/rocksdb/issues/19.Google Scholar
Google Inc. 2017. Google Snappy. (October 2017). http://google.github.io/snappy.Google Scholar
Google Inc. 2017. LevelDB. (October 2017). http://leveldb.org.Google Scholar
IBM Inc. 2017. IBM Big Data Analytics. (October 2017). https://www.ibm.com/ analytics/us/en/big-data/.Google Scholar
IBM inc. 2017. Understanding Netezza Zone Maps. (October 2017). https://www.ibm.com/developerworks/community/blogs/Wce085e09749a_4650_a064_bb3f3b738fa3/entry/understanding_netezza_zone_maps?lang=en.Google Scholar
MongoDB Inc. 2017. MongoDB. (October 2017). http://www.mongodb.com.Google Scholar
Oracle Inc. 2017. Oracle: Using Zone Maps. (October 2017). http://docs.oracle. com/database/121/DWHSG/zone_maps.htm.Google Scholar
Teradata Inc. 2017. Teradata Teradata Analytics for Enterprise Applications. (October 2017). http://www.teradata.com/analyticssolutions.Google Scholar
Bettina Kemme and Gustavo Alonso. 2010. Database Replication: A Tale of Research Across Communities. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 5--12. Google ScholarDigital Library
UCR Database Lab. 2017. Project website for open source code and workload generator. (October 2017). http://dblab.cs.ucr.edu/projects/KeyValueIndexes/.Google Scholar
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev. 44, 2 (apr 2010), 35--40. Google ScholarDigital Library
Lucas Lersch, Ismail Oukid, Wolfgang Lehner, and Ivan Schreter. 2017. An analysis of LSM caching in NVRAM. In Proceedings of the 13th International Workshop on Data Management on New Hardware. ACM, 9. Google ScholarDigital Library
Mahdi Tayarani Najaran and Norman C Hutchinson. 2013. Innesto: A searchable key/value store for highly dimensional data. In CloudCom. IEEE, 411--420. Google ScholarDigital Library
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351--385. Google ScholarDigital Library
Mohiuddin Abdul Qader and Vagelis Hristidis. 2017. Dualdb: An efficient lsm-based publish/subscribe storage system. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 24. Google ScholarDigital Library
Wei Tan, Sandeep Tata, Yuzhe Tang, and Liana Fong. 2014. Diff-Index: Differenti- ated Index in Distributed Log-Structured Data Stores. In EDBT. 700--711.Google Scholar
Jianjun Zheng, Qian Lin, Jiatao Xu, Cheng Wei, Chuwei Zeng, Pingan Yang, and Yunfan Zhang. 2017. PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment 10, 12 (2017), 1730--1741. Google ScholarDigital Library

Index Terms

A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
        Key-value stores
  2. Information storage systems
    1. Record storage systems
      1. Record storage alternatives
        Indexed file organization

Recommendations

Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems
LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, ...
Read More
LSM-Trees and B-Trees: The Best of Both Worlds
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

LSM-Trees and B-Trees are the two primary data structures used as storage engines in modern key-value (KV) stores. These two structures are optimal for different workloads; LSM-Trees perform better on update queries, whereas B-Trees are preferable for ...
Read More
Multi-core Adaptive Merging of the Secondary Index for LSM-Based Stores
Database and Expert Systems Applications
Abstract
NoSQL databases have gained great popularity recently. Most of them use the Log Structured Merge (LSM) tree which provides fast write throughput and fast lookup of primary keys. Nevertheless, searching by non-key attributes is very slow because ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
leveldb
lsm-tree
nosql
secondary indexing
top-$k$
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 1,727
  Total Downloads
- Downloads (Last 12 months)290
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems

LSM-Trees and B-Trees: The Best of Both Worlds

Multi-core Adaptive Merging of the Secondary Index for LSM-Based Stores

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems

LSM-Trees and B-Trees: The Best of Both Worlds

Multi-core Adaptive Merging of the Secondary Index for LSM-Based Stores

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media