skip to main content
10.1145/3627703.3650072acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open Access

Trinity: A Fast Compressed Multi-attribute Data Store

Published:22 April 2024Publication History

ABSTRACT

With the proliferation of attribute-rich machine-generated data, emerging real-time monitoring, diagnosis, and visualization tools ingest and analyze such data across multiple attributes simultaneously. Due to the sheer volume of the data, applications need storage-efficient and performant data representations to analyze them efficiently.

We present TRINITY, a system that simultaneously facilitates query and storage efficiency across large volumes of multi-attribute records. Trinity accomplishes this through a new dynamic, succinct, multi-dimensional data structure, MdTrie. MdTrie employs a combination of novel Morton code generalization, a multi-attribute query algorithm, and a self-indexed trie structure to achieve the above goals. Our evaluation of TRINITY for real-world use-cases shows that compared to state-of-the-art systems, it supports (1) 7.2-59.6× faster multi-attribute searches, (2) storage footprint comparable to OLAP columnar stores and 4.8-15.1× lower than NoSQL stores and OLTP databases, and (3) point query throughput comparable to NoSQL stores and 1.7-52.5× higher than OLTP databases and OLAP columnar stores.

References

  1. Jianqing Fan, Fang Han, and Han Liu. Challenges of big data analysis. National science review, 1(2):293--314, 2014.Google ScholarGoogle Scholar
  2. Michael P Andersen and David E. Culler. Btrdb: Optimizing storage system design for timeseries processing. In FAST, pages 39--52, 2016.Google ScholarGoogle Scholar
  3. Emma M. Stewart, Anna Liao, and Ciaran Roberts. Open μpmu: A real world reference distribution micro-phasor measurement unit data set for research and application development. IEEE, 2016.Google ScholarGoogle Scholar
  4. Henggang Cui, Kimberly Keeton, Indrajit Roy, Krishnamurthy Viswanathan, and Gregory R. Ganger. Using data transformations for low-latency time series analysis. In ACM SoCC, pages 395--407, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Galen Reeves, Jie Liu, Suman Nath, and Feng Zhao. Managing massive time series streams with multi-scale compressed trickles. VLDB, 2(1):97--108, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, Bhuwan Chopra, Ciprian Gerea, Daniel Merl, Josh Metzler, David Reiss, Subbu Subramanian, Janet L. Wiener, and Okay Zed. Scuba: Diving into data at facebook. VLDB, 6(11):1057--1067, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time series database. VLDB, 8(12):1816--1827, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Google Stackdriver. https://cloud.google.com/stackdriver/.Google ScholarGoogle Scholar
  9. Amazon CloudWatch. https://aws.amazon.com/cloudwatch/.Google ScholarGoogle Scholar
  10. Anurag Khandelwal, Rachit Agarwal, and Ion Stoica. Confluo: Distributed monitoring and diagnosis stack for high-speed networks. In NSDI, pages 421--436, 2019.Google ScholarGoogle Scholar
  11. M. Moshref, M. Yu, R. Govindan, and A. Vahdat. Trumpet: Timely and Precise Triggers in Data Centers. In SIGCOMM, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Tammana, R. Agarwal, and M. Lee. Simplifying Datacenter Network Debugging with PathDump. In OSDI, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. NYC Taxi Download. https://tinyurl.com/bdk9k5uk.Google ScholarGoogle Scholar
  14. Uber's Big Data Platform: 100+ Petabytes with Minute Latency. https://www.uber.com/blog/uber-big-data-platform/.Google ScholarGoogle Scholar
  15. Uber Freight Carrier Metrics with Near-Real-Time Analytics. https://tinyurl.com/bdj68hd9.Google ScholarGoogle Scholar
  16. Introducing AresDB: Uber's GPU-Powered Open Source, Real-time Analytics Engine. https://www.uber.com/blog/aresdb/.Google ScholarGoogle Scholar
  17. Haitao Yuan and Guoliang Li. A survey of traffic prediction: from spatio-temporal data to intelligent transportation. Data Science and Engineering, 6:63--85, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  18. Andreas Papadopoulos and Dimitrios Katsaros. A-tree: Distributed indexing of multidimensional data for cloud computing environments. In IEEE, pages 407--414, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yu Hua, Dan Feng, and Ting Xie. Multi-dimensional range query for data management using bloom filters. In IEEE, pages 428--433, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, and Stratos Idreos. Rosetta: A robust space-time optimized range filter for key-value stores. In SIGMOD, pages 2071--2086, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Rudolf Bayer and Volker Markl. The ub-tree: Performance of multidimensional range queries. Technical report, 1998.Google ScholarGoogle Scholar
  22. Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Antonin Guttman. R-trees: A dynamic index structure for spatial searching. SIGMOD '84, page 47--57, New York, NY, USA, 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xiangyu Zhang, Jing Ai, Zhongyuan Wang, Jiaheng Lu, and Xiaofeng Meng. An efficient multi-dimensional index for cloud data management. In Proceedings of the first international workshop on Cloud data management, pages 17--24, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Huanchen Zhang, David G Andersen, Andrew Pavlo, Michael Kaminsky, Lin Ma, and Rui Shen. Reducing the storage overhead of mainmemory oltp databases with hybrid indexes. In Proceedings of the 2016 International Conference on Management of Data, pages 1567--1581, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Rachit Agarwal, Anurag Khandelwal, and Ion Stoica. Succinct: Enabling queries on compressed data. In NSDI, pages 337--350, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Anurag Khandelwal, Rachit Agarwal, and Ion Stoica. Blowfish: Dynamic storage-performance tradeoff in data stores. In NSDI, pages 485--500, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Anurag Khandelwal, Zongheng Yang, Evan Ye, Rachit Agarwal, and Ion Stoica. Zipg: A memory-efficient graph store for interactive queries. In SIGMOD, pages 1149--1164, 2017.Google ScholarGoogle Scholar
  29. Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. Surf: Practical range query filtering with fast succinct tries. In SIGMOD, pages 323--336, 2018.Google ScholarGoogle Scholar
  30. Tilmann Zäschke, Christoph Zimmerli, and Moira C Norrie. The phtree: a space-efficient storage structure and multi-dimensional index. In SIGMOD, pages 397--408, 2014.Google ScholarGoogle Scholar
  31. Guy Joseph Jacobson. Succinct Static Data Structures. PhD thesis, CMU, 1988.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. PODS '84, page 181--190, New York, NY, USA, 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Steven M Rubin and Turner Whitted. A 3-dimensional representation for fast rendering of complex scenes. In PACMCGIT, pages 110--116, 1980.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. MongoDB. http://www.mongodb.org.Google ScholarGoogle Scholar
  35. Avinash Lakshman and Prashant Malik. Cassandra: A Decentralized Structured Storage System. SIGOPS, 44(2):35--40, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Elasticsearch. http://www.elasticsearch.org.Google ScholarGoogle Scholar
  37. Swaminathan Sivasubramanian. Amazon dynamoDB: A Seamlessly Scalable Non-relational Database Service. In SIGMOD, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Apache HBase. https://hbase.apache.org/.Google ScholarGoogle Scholar
  39. SingleStore: The Database for the Data-Intensive Era. https://www.singlestore.com/.Google ScholarGoogle Scholar
  40. Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger, and Jens Teubner. MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine. In SIGMOD, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. SAP HANA. http://www.saphana.com/.Google ScholarGoogle Scholar
  42. Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. Learning multi-dimensional indexes. SIGMOD '20, page 985--1000, New York, NY, USA, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Stefan Sprenger, Patrick Schäfer, and Ulf Leser. Bb-tree: A mainmemory index structure for multidimensional range queries. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1566--1569. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  44. Songrui Wu, Qi Li, Guoliang Li, Dong Yuan, Xingliang Yuan, and Cong Wang. Servedb: Secure, verifiable, and efficient range queries on outsourced database. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 626--637. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  45. Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. Druid: A real-time analytical data store. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 157--168, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. TimescaleDB: SQL made scalable for time-series data. https://tinyurl.com/e9r9an3y.Google ScholarGoogle Scholar
  47. V Srinivasan, Brian Bulkowski, Wei-Ling Chu, Sunil Sayyaparaju, Andrew Gooding, Rajkumar Iyer, Ashish Shinde, and Thomas Lopatic. Aerospike: Architecture of a real-time operational dbms. VLDB, 9(13):1389--1400, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jin-Yi Cai, Venkatesan T. Chakaravarthy, Raghav Kaushik, and Jeffrey F. Naughton. On the complexity of join predicates. PODS '01, page 207--214, New York, NY, USA, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shumo Chu, Magdalena Balazinska, and Dan Suciu. From theory to practice: Efficient join query evaluation in a parallel database system. SIGMOD '15, page 63--78, New York, NY, USA, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, et al. Alex: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 969--984, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Lars Arge, Mark De Berg, Herman Haverkort, and Ke Yi. The priority r-tree: A practically efficient and worst-case optimal r-tree. ACM Transactions on Algorithms (TALG), 4(1):1--30, 2008.Google ScholarGoogle Scholar
  52. Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Ake Larson, Donald Kossmann, and Rajeev Acharya. Qd-tree: Learning data layouts for big data analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 193--208, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. ClickHouse. https://clickhouse.com/.Google ScholarGoogle Scholar
  54. Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. VLDB, 2020.Google ScholarGoogle Scholar
  55. Jonathan K. Lawder and Peter J. H. King. Querying multi-dimensional data indexed using the hilbert space-filling curve. ACM Sigmod Record, 30(1):19--24, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Peter Kirschenhofer, Helmut Prodinger, and Wojciech Szpankowski. Multidimensional digital searching and some new parameters in tries. International Journal of Foundations of Computer Science, 4(01):69--84, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  57. Bradford G Nickerson and Qingxiu Shi. On k-d range search with patricia tries. SIAM Journal on Computing, 37(5):1373--1386, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Naila Rahman, Rajeev Raman, et al. Engineering the louds succinct tree representation. In International Workshop on Experimental and Efficient Algorithms, pages 134--145. Springer, 2006.Google ScholarGoogle Scholar
  59. David Benoit, Erik D Demaine, J Ian Munro, Rajeev Raman, Venkatesh Raman, and S Srinivasa Rao. Representing trees of higher degree. Algorithmica, 43(4):275--292, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  60. Diego Arroyuelo, Rodrigo Cánovas, Gonzalo Navarro, and Kunihiko Sadakane. Succinct trees in practice. In 2010 ALENEX, pages 84--97. SIAM, 2010.Google ScholarGoogle Scholar
  61. Diego Arroyuelo, Guillermo de Bernardo, Travis Gagie, and Gonzalo Navarro. Faster dynamic compressed d-ary relations. In International Symposium on String Processing and Information Retrieval, pages 419--433. Springer, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. David A White and Ramesh Jain. Similarity indexing with the ss-tree. In IEEE, pages 516--523, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  63. Intrinsics for Bitwise Logical Operations. https://tinyurl.com/vjxcnh52.Google ScholarGoogle Scholar
  64. Delta Encoding. http://en.wikipedia.org/wiki/Delta_encoding.Google ScholarGoogle Scholar
  65. Redis. http://www.redis.io.Google ScholarGoogle Scholar
  66. Robert Escriva, Bernard Wong, and Emin Gün Sirer. HyperDex: A Distributed, Searchable Key-value Store. In ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), 2012.Google ScholarGoogle Scholar
  67. TPC-H Download. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  68. Github Events Download. https://tinyurl.com/yme6zp7r.Google ScholarGoogle Scholar
  69. A ride through NYC: SQL queries visualization. https://tinyurl.com/2s3j3ce9.Google ScholarGoogle Scholar
  70. New York City Taxi and For-Hire Vehicle Data. https://tinyurl.com/bdk9k5uk.Google ScholarGoogle Scholar
  71. Introduction to IoT: New York City Taxicabs. https: //tinyurl.com/4fnbsx63.Google ScholarGoogle Scholar
  72. Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In ACM SoCC, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The r-tree: An efficient and robust access method for points and rectangles. In SIGMOD, pages 322--331, 1990.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. BB-Tree: C++ implementation. https://github.com/flippingbits/bb-tree.Google ScholarGoogle Scholar
  75. Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. The design and operation of CloudLab. In USENIX ATC 19, pages 1--14, Renton, WA, July 2019.Google ScholarGoogle Scholar
  76. ClickHouse Low Throughput Github Issue. https://tinyurl.com/2p9fyj3b.Google ScholarGoogle Scholar
  77. R-Tree: C++ implementation. https://tinyurl.com/5f4n4njn.Google ScholarGoogle Scholar
  78. Huanchen Zhang, Xiaoxuan Liu, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. Order-preserving key compression for in-memory search trees. SIGMOD '20, page 1601--1615, New York, NY, USA, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. David J Abel and David M Mark. A comparative analysis of some two-dimensional orderings. International Journal of Geographical Information Systems, 4(1):21--31, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  80. Bongki Moon, Hosagrahar V Jagadish, Christos Faloutsos, and Joel H. Saltz. Analysis of the clustering properties of the hilbert space-filling curve. IEEE, 13(1):124--141, 2001.Google ScholarGoogle Scholar
  81. Robert Escriva, Bernard Wong, and Emin Gün Sirer. Hyperdex: A distributed, searchable key-value store. In Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication, pages 25--36, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems
    April 2024
    1245 pages
    ISBN:9798400704376
    DOI:10.1145/3627703

    Copyright © 2024 Owner/Author

    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 April 2024

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate241of1,308submissions,18%
  • Article Metrics

    • Downloads (Last 12 months)177
    • Downloads (Last 6 weeks)177

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader