ABSTRACT
With the proliferation of attribute-rich machine-generated data, emerging real-time monitoring, diagnosis, and visualization tools ingest and analyze such data across multiple attributes simultaneously. Due to the sheer volume of the data, applications need storage-efficient and performant data representations to analyze them efficiently.
We present TRINITY, a system that simultaneously facilitates query and storage efficiency across large volumes of multi-attribute records. Trinity accomplishes this through a new dynamic, succinct, multi-dimensional data structure, MdTrie. MdTrie employs a combination of novel Morton code generalization, a multi-attribute query algorithm, and a self-indexed trie structure to achieve the above goals. Our evaluation of TRINITY for real-world use-cases shows that compared to state-of-the-art systems, it supports (1) 7.2-59.6× faster multi-attribute searches, (2) storage footprint comparable to OLAP columnar stores and 4.8-15.1× lower than NoSQL stores and OLTP databases, and (3) point query throughput comparable to NoSQL stores and 1.7-52.5× higher than OLTP databases and OLAP columnar stores.
- Jianqing Fan, Fang Han, and Han Liu. Challenges of big data analysis. National science review, 1(2):293--314, 2014.Google Scholar
- Michael P Andersen and David E. Culler. Btrdb: Optimizing storage system design for timeseries processing. In FAST, pages 39--52, 2016.Google Scholar
- Emma M. Stewart, Anna Liao, and Ciaran Roberts. Open μpmu: A real world reference distribution micro-phasor measurement unit data set for research and application development. IEEE, 2016.Google Scholar
- Henggang Cui, Kimberly Keeton, Indrajit Roy, Krishnamurthy Viswanathan, and Gregory R. Ganger. Using data transformations for low-latency time series analysis. In ACM SoCC, pages 395--407, 2015.Google ScholarDigital Library
- Galen Reeves, Jie Liu, Suman Nath, and Feng Zhao. Managing massive time series streams with multi-scale compressed trickles. VLDB, 2(1):97--108, 2009.Google ScholarDigital Library
- Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl, Josh Metzler, David Reiss, Subbu Subramanian, Janet L. Wiener, and Okay Zed. Scuba: Diving into data at facebook. VLDB, 6(11):1057--1067, 2013.Google ScholarDigital Library
- Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time series database. VLDB, 8(12):1816--1827, 2015.Google ScholarDigital Library
- Google Stackdriver. https://cloud.google.com/stackdriver/.Google Scholar
- Amazon CloudWatch. https://aws.amazon.com/cloudwatch/.Google Scholar
- Anurag Khandelwal, Rachit Agarwal, and Ion Stoica. Confluo: Distributed monitoring and diagnosis stack for high-speed networks. In NSDI, pages 421--436, 2019.Google Scholar
- M. Moshref, M. Yu, R. Govindan, and A. Vahdat. Trumpet: Timely and Precise Triggers in Data Centers. In SIGCOMM, 2016.Google ScholarDigital Library
- P. Tammana, R. Agarwal, and M. Lee. Simplifying Datacenter Network Debugging with PathDump. In OSDI, 2016.Google ScholarDigital Library
- NYC Taxi Download. https://tinyurl.com/bdk9k5uk.Google Scholar
- Uber's Big Data Platform: 100+ Petabytes with Minute Latency. https://www.uber.com/blog/uber-big-data-platform/.Google Scholar
- Uber Freight Carrier Metrics with Near-Real-Time Analytics. https://tinyurl.com/bdj68hd9.Google Scholar
- Introducing AresDB: Uber's GPU-Powered Open Source, Real-time Analytics Engine. https://www.uber.com/blog/aresdb/.Google Scholar
- Haitao Yuan and Guoliang Li. A survey of traffic prediction: from spatio-temporal data to intelligent transportation. Data Science and Engineering, 6:63--85, 2021.Google ScholarCross Ref
- Andreas Papadopoulos and Dimitrios Katsaros. A-tree: Distributed indexing of multidimensional data for cloud computing environments. In IEEE, pages 407--414, 2011.Google ScholarDigital Library
- Yu Hua, Dan Feng, and Ting Xie. Multi-dimensional range query for data management using bloom filters. In IEEE, pages 428--433, 2007.Google ScholarDigital Library
- Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, and Stratos Idreos. Rosetta: A robust space-time optimized range filter for key-value stores. In SIGMOD, pages 2071--2086, 2020.Google ScholarDigital Library
- Rudolf Bayer and Volker Markl. The ub-tree: Performance of multidimensional range queries. Technical report, 1998.Google Scholar
- Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975.Google ScholarDigital Library
- Antonin Guttman. R-trees: A dynamic index structure for spatial searching. SIGMOD '84, page 47--57, New York, NY, USA, 1984.Google ScholarDigital Library
- Xiangyu Zhang, Jing Ai, Zhongyuan Wang, Jiaheng Lu, and Xiaofeng Meng. An efficient multi-dimensional index for cloud data management. In Proceedings of the first international workshop on Cloud data management, pages 17--24, 2009.Google ScholarDigital Library
- Huanchen Zhang, David G Andersen, Andrew Pavlo, Michael Kaminsky, Lin Ma, and Rui Shen. Reducing the storage overhead of mainmemory oltp databases with hybrid indexes. In Proceedings of the 2016 International Conference on Management of Data, pages 1567--1581, 2016.Google ScholarDigital Library
- Rachit Agarwal, Anurag Khandelwal, and Ion Stoica. Succinct: Enabling queries on compressed data. In NSDI, pages 337--350, 2015.Google ScholarDigital Library
- Anurag Khandelwal, Rachit Agarwal, and Ion Stoica. Blowfish: Dynamic storage-performance tradeoff in data stores. In NSDI, pages 485--500, 2016.Google ScholarDigital Library
- Anurag Khandelwal, Zongheng Yang, Evan Ye, Rachit Agarwal, and Ion Stoica. Zipg: A memory-efficient graph store for interactive queries. In SIGMOD, pages 1149--1164, 2017.Google Scholar
- Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. Surf: Practical range query filtering with fast succinct tries. In SIGMOD, pages 323--336, 2018.Google Scholar
- Tilmann Zäschke, Christoph Zimmerli, and Moira C Norrie. The phtree: a space-efficient storage structure and multi-dimensional index. In SIGMOD, pages 397--408, 2014.Google Scholar
- Guy Joseph Jacobson. Succinct Static Data Structures. PhD thesis, CMU, 1988.Google ScholarDigital Library
- J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. PODS '84, page 181--190, New York, NY, USA, 1984.Google ScholarDigital Library
- Steven M Rubin and Turner Whitted. A 3-dimensional representation for fast rendering of complex scenes. In PACMCGIT, pages 110--116, 1980.Google ScholarDigital Library
- MongoDB. http://www.mongodb.org.Google Scholar
- Avinash Lakshman and Prashant Malik. Cassandra: A Decentralized Structured Storage System. SIGOPS, 44(2):35--40, 2010.Google ScholarDigital Library
- Elasticsearch. http://www.elasticsearch.org.Google Scholar
- Swaminathan Sivasubramanian. Amazon dynamoDB: A Seamlessly Scalable Non-relational Database Service. In SIGMOD, 2012.Google ScholarDigital Library
- Apache HBase. https://hbase.apache.org/.Google Scholar
- SingleStore: The Database for the Data-Intensive Era. https://www.singlestore.com/.Google Scholar
- Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger, and Jens Teubner. MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine. In SIGMOD, 2006.Google ScholarDigital Library
- SAP HANA. http://www.saphana.com/.Google Scholar
- Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. Learning multi-dimensional indexes. SIGMOD '20, page 985--1000, New York, NY, USA, 2020.Google ScholarDigital Library
- Stefan Sprenger, Patrick Schäfer, and Ulf Leser. Bb-tree: A mainmemory index structure for multidimensional range queries. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1566--1569. IEEE, 2019.Google ScholarCross Ref
- Songrui Wu, Qi Li, Guoliang Li, Dong Yuan, Xingliang Yuan, and Cong Wang. Servedb: Secure, verifiable, and efficient range queries on outsourced database. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 626--637. IEEE, 2019.Google ScholarCross Ref
- Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. Druid: A real-time analytical data store. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 157--168, 2014.Google ScholarDigital Library
- TimescaleDB: SQL made scalable for time-series data. https://tinyurl.com/e9r9an3y.Google Scholar
- V Srinivasan, Brian Bulkowski, Wei-Ling Chu, Sunil Sayyaparaju, Andrew Gooding, Rajkumar Iyer, Ashish Shinde, and Thomas Lopatic. Aerospike: Architecture of a real-time operational dbms. VLDB, 9(13):1389--1400, 2016.Google ScholarDigital Library
- Jin-Yi Cai, Venkatesan T. Chakaravarthy, Raghav Kaushik, and Jeffrey F. Naughton. On the complexity of join predicates. PODS '01, page 207--214, New York, NY, USA, 2001.Google ScholarDigital Library
- Shumo Chu, Magdalena Balazinska, and Dan Suciu. From theory to practice: Efficient join query evaluation in a parallel database system. SIGMOD '15, page 63--78, New York, NY, USA, 2015.Google ScholarDigital Library
- Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, et al. Alex: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 969--984, 2020.Google ScholarDigital Library
- Lars Arge, Mark De Berg, Herman Haverkort, and Ke Yi. The priority r-tree: A practically efficient and worst-case optimal r-tree. ACM Transactions on Algorithms (TALG), 4(1):1--30, 2008.Google Scholar
- Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Ake Larson, Donald Kossmann, and Rajeev Acharya. Qd-tree: Learning data layouts for big data analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 193--208, 2020.Google ScholarDigital Library
- ClickHouse. https://clickhouse.com/.Google Scholar
- Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. VLDB, 2020.Google Scholar
- Jonathan K. Lawder and Peter J. H. King. Querying multi-dimensional data indexed using the hilbert space-filling curve. ACM Sigmod Record, 30(1):19--24, 2001.Google ScholarDigital Library
- Peter Kirschenhofer, Helmut Prodinger, and Wojciech Szpankowski. Multidimensional digital searching and some new parameters in tries. International Journal of Foundations of Computer Science, 4(01):69--84, 1993.Google ScholarCross Ref
- Bradford G Nickerson and Qingxiu Shi. On k-d range search with patricia tries. SIAM Journal on Computing, 37(5):1373--1386, 2008.Google ScholarDigital Library
- Naila Rahman, Rajeev Raman, et al. Engineering the louds succinct tree representation. In International Workshop on Experimental and Efficient Algorithms, pages 134--145. Springer, 2006.Google Scholar
- David Benoit, Erik D Demaine, J Ian Munro, Rajeev Raman, Venkatesh Raman, and S Srinivasa Rao. Representing trees of higher degree. Algorithmica, 43(4):275--292, 2005.Google ScholarCross Ref
- Diego Arroyuelo, Rodrigo Cánovas, Gonzalo Navarro, and Kunihiko Sadakane. Succinct trees in practice. In 2010 ALENEX, pages 84--97. SIAM, 2010.Google Scholar
- Diego Arroyuelo, Guillermo de Bernardo, Travis Gagie, and Gonzalo Navarro. Faster dynamic compressed d-ary relations. In International Symposium on String Processing and Information Retrieval, pages 419--433. Springer, 2019.Google ScholarDigital Library
- David A White and Ramesh Jain. Similarity indexing with the ss-tree. In IEEE, pages 516--523, 1996.Google ScholarCross Ref
- Intrinsics for Bitwise Logical Operations. https://tinyurl.com/vjxcnh52.Google Scholar
- Delta Encoding. http://en.wikipedia.org/wiki/Delta_encoding.Google Scholar
- Redis. http://www.redis.io.Google Scholar
- Robert Escriva, Bernard Wong, and Emin Gün Sirer. HyperDex: A Distributed, Searchable Key-value Store. In ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), 2012.Google Scholar
- TPC-H Download. http://www.tpc.org/tpch/.Google Scholar
- Github Events Download. https://tinyurl.com/yme6zp7r.Google Scholar
- A ride through NYC: SQL queries visualization. https://tinyurl.com/2s3j3ce9.Google Scholar
- New York City Taxi and For-Hire Vehicle Data. https://tinyurl.com/bdk9k5uk.Google Scholar
- Introduction to IoT: New York City Taxicabs. https: //tinyurl.com/4fnbsx63.Google Scholar
- Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In ACM SoCC, 2010.Google ScholarDigital Library
- Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The r-tree: An efficient and robust access method for points and rectangles. In SIGMOD, pages 322--331, 1990.Google ScholarDigital Library
- BB-Tree: C++ implementation. https://github.com/flippingbits/bb-tree.Google Scholar
- Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. The design and operation of CloudLab. In USENIX ATC 19, pages 1--14, Renton, WA, July 2019.Google Scholar
- ClickHouse Low Throughput Github Issue. https://tinyurl.com/2p9fyj3b.Google Scholar
- R-Tree: C++ implementation. https://tinyurl.com/5f4n4njn.Google Scholar
- Huanchen Zhang, Xiaoxuan Liu, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. Order-preserving key compression for in-memory search trees. SIGMOD '20, page 1601--1615, New York, NY, USA, 2020.Google ScholarDigital Library
- David J Abel and David M Mark. A comparative analysis of some two-dimensional orderings. International Journal of Geographical Information Systems, 4(1):21--31, 1990.Google ScholarCross Ref
- Bongki Moon, Hosagrahar V Jagadish, Christos Faloutsos, and Joel H. Saltz. Analysis of the clustering properties of the hilbert space-filling curve. IEEE, 13(1):124--141, 2001.Google Scholar
- Robert Escriva, Bernard Wong, and Emin Gün Sirer. Hyperdex: A distributed, searchable key-value store. In Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication, pages 25--36, 2012.Google ScholarDigital Library
Recommendations
Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataAzure Data Lake Store (ADLS) is a fully-managed, elastic, scalable, and secure file system that supports Hadoop distributed file system (HDFS) and Cosmos semantics. It is specifically designed and optimized for a broad spectrum of Big Data analytics ...
Adding data analytics capabilities to scaled-out object store
In-situ MapReduce computation on large-scale data in object store.Scale object store while computation layer remains lightweight.Implementation with Hadoop and Ceph storage system.Improved initial data ingest performance by up to 96.Improved MapReduce ...
HC-Store: putting MapReduce's foot in two camps
MapReduce is a popular framework for large-scale data analysis. As data access is critical for MapReduce's performance, some recent work has applied different storage models, such as column-store or PAX-store, to MapReduce platforms. However, the data ...
Comments