Skip to main content

LuBase: A Search-Efficient Hybrid Storage System for Massive Text Data

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9529))

Abstract

Recent years have witnessed a great deal of enthusiasm devoting to big data analytics systems, some of them, with the property of high scalability and fault tolerance, are extensively used in real productions. However, such systems are mostly designed for processing immutable data stored in HDFS, not suitable for real-time text data in NoSQL database like HBase. In this paper, we propose a search-efficient hybrid storage system termed LuBase for large-scale text data analytics scenarios. Not just a novel hybrid storage system with fine-grained index, LuBase also presents a new query process flow which can fully employ pre-built full-text index to accelerate the execution of interactive queries and achieve more efficient I/O performance at the same time. We implemented LuBase in a data analytics system based on Impala. Experimental results demonstrate that LuBase can reap huge fruits from Lucene index technique and bring significant performance improvement for Impala when querying HBase.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org/

  2. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)

    Google Scholar 

  3. Apache HBase. http://hbase.apache.org/

  4. Apache Lucene: The de facto standard for search libraries. http://lucene.apache.org/

  5. ITHBase. https://github.com/hbase-trx/hbase-transactional-tableindexed

  6. IHbase, An extension of HBASE core which support faster scans at the expense of larger RAM consumption. https://github.com/ykulbak/ihbase/

  7. Lily. http://www.lilyproject.org/lily/index.html

  8. Hindex: Secondary Index for HBase. https://github.com/Huawei-Hadoop/hindex

  9. Zou, Y., Liu, J., Wang, S., Zha, L., Xu, Z.: CCIndex: a complemental clustering index on distributed ordered tables for multi-dimensional range queries. In: Ding, C., Shao, Z., Zheng, R. (eds.) NPC 2010. LNCS, vol. 6289, pp. 247–261. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  10. Gao, X., Nachankar, V., Qiu, J.: Experimenting lucene index on HBase in an HPC environment. In: HPCDB 2011 Proceedings of the First Annual Workshop on High Performance Computing Meets Databases, pp. 25–28. ACM (2011)

    Google Scholar 

  11. Esposito, C., Ficco, M., Palmieri, F., Castiglione, A.: Smart cloud storage service selection based on fuzzy logic, theory of evidence and game theory. IEEE Trans. Comput. (1), p. 1 (2015) (in press)

    Google Scholar 

  12. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. In: Proceedings of the VLDB Endowment, vol. 2, pp. 1626–1629. VLDB Endowment (2009)

    Google Scholar 

  13. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation, vol. 51, pp. 107–113. ACM, New York, USA (2008)

    Google Scholar 

  14. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S.: A comparison of approaches to large-scale data analysis. In: SIGMOD 2009 Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. pp. 165–178. ACM, New York, USA (2009)

    Google Scholar 

  15. Armbrust, M., Xin, R.S., Lian, Ch., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: ACM SIGMOD Conference 2015, Melbourne, Victoria, Australia (2015)

    Google Scholar 

  16. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of web-scale datasets. In: VLDB 2010, 36th International Conference on Very Large Data Bases, pp. 330–339. VLDB Endowment (2010)

    Google Scholar 

  17. Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, L., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Milne, S.W., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: CIDR (2015)

    Google Scholar 

  18. Floratou, A., Minhas, U.F., Ozcan, F.: SQL-on-Hadoop: full circle back to shared-nothing database architectures. In: Proceedings of the VLDB Endowment, vol. 7, pp. 1295–1306. VLDB Endowment (2014)

    Google Scholar 

  19. O’Neil, P., O’Neil, E., Chen, X., Revilak, S.: The star schema benchmark and augmented fact table indexing. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 237–252. Springer, Heidelberg (2009)

    Google Scholar 

  20. TPC-H: An ad-hoc, decision support benchmark. http://www.tpc.org/tpch/

Download references

Acknowledgments

This work is partially supported by National HeGaoJi Key Project under grant numbered 2013ZX01039-002-001-001, the National KeJiZhiCheng Project under grant numbered 2012BAH46B03, and “Strategic Priority Research Program” of the Chinese Academy of Sciences under grant numbered XDA06030200.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoyan Gu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Jia, D. et al. (2015). LuBase: A Search-Efficient Hybrid Storage System for Massive Text Data. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27122-4_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27121-7

  • Online ISBN: 978-3-319-27122-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics