LuBase: A Search-Efficient Hybrid Storage System for Massive Text Data

Jia, Debin; Liu, Zhengwei; Gu, Xiaoyan; Li, Bo; Gu, Jingzi; Wang, Weiping; Meng, Dan

doi:10.1007/978-3-319-27122-4_10

Debin Jia^17,19,20,
Zhengwei Liu^18,21,
Xiaoyan Gu¹⁷,
Bo Li¹⁷,
Jingzi Gu¹⁷,
Weiping Wang¹⁷ &
…
Dan Meng¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9529))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1368 Accesses
1 Citations

Abstract

Recent years have witnessed a great deal of enthusiasm devoting to big data analytics systems, some of them, with the property of high scalability and fault tolerance, are extensively used in real productions. However, such systems are mostly designed for processing immutable data stored in HDFS, not suitable for real-time text data in NoSQL database like HBase. In this paper, we propose a search-efficient hybrid storage system termed LuBase for large-scale text data analytics scenarios. Not just a novel hybrid storage system with fine-grained index, LuBase also presents a new query process flow which can fully employ pre-built full-text index to accelerate the execution of interactive queries and achieve more efficient I/O performance at the same time. We implemented LuBase in a data analytics system based on Impala. Experimental results demonstrate that LuBase can reap huge fruits from Lucene index technique and bring significant performance improvement for Impala when querying HBase.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org/
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Google Scholar
Apache HBase. http://hbase.apache.org/
Apache Lucene: The de facto standard for search libraries. http://lucene.apache.org/
ITHBase. https://github.com/hbase-trx/hbase-transactional-tableindexed
IHbase, An extension of HBASE core which support faster scans at the expense of larger RAM consumption. https://github.com/ykulbak/ihbase/
Lily. http://www.lilyproject.org/lily/index.html
Hindex: Secondary Index for HBase. https://github.com/Huawei-Hadoop/hindex
Zou, Y., Liu, J., Wang, S., Zha, L., Xu, Z.: CCIndex: a complemental clustering index on distributed ordered tables for multi-dimensional range queries. In: Ding, C., Shao, Z., Zheng, R. (eds.) NPC 2010. LNCS, vol. 6289, pp. 247–261. Springer, Heidelberg (2010)
Chapter Google Scholar
Gao, X., Nachankar, V., Qiu, J.: Experimenting lucene index on HBase in an HPC environment. In: HPCDB 2011 Proceedings of the First Annual Workshop on High Performance Computing Meets Databases, pp. 25–28. ACM (2011)
Google Scholar
Esposito, C., Ficco, M., Palmieri, F., Castiglione, A.: Smart cloud storage service selection based on fuzzy logic, theory of evidence and game theory. IEEE Trans. Comput. (1), p. 1 (2015) (in press)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. In: Proceedings of the VLDB Endowment, vol. 2, pp. 1626–1629. VLDB Endowment (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation, vol. 51, pp. 107–113. ACM, New York, USA (2008)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S.: A comparison of approaches to large-scale data analysis. In: SIGMOD 2009 Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. pp. 165–178. ACM, New York, USA (2009)
Google Scholar
Armbrust, M., Xin, R.S., Lian, Ch., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: ACM SIGMOD Conference 2015, Melbourne, Victoria, Australia (2015)
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of web-scale datasets. In: VLDB 2010, 36th International Conference on Very Large Data Bases, pp. 330–339. VLDB Endowment (2010)
Google Scholar
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, L., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Milne, S.W., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: CIDR (2015)
Google Scholar
Floratou, A., Minhas, U.F., Ozcan, F.: SQL-on-Hadoop: full circle back to shared-nothing database architectures. In: Proceedings of the VLDB Endowment, vol. 7, pp. 1295–1306. VLDB Endowment (2014)
Google Scholar
O’Neil, P., O’Neil, E., Chen, X., Revilak, S.: The star schema benchmark and augmented fact table indexing. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 237–252. Springer, Heidelberg (2009)
Google Scholar
TPC-H: An ad-hoc, decision support benchmark. http://www.tpc.org/tpch/

Download references

Acknowledgments

This work is partially supported by National HeGaoJi Key Project under grant numbered 2013ZX01039-002-001-001, the National KeJiZhiCheng Project under grant numbered 2012BAH46B03, and “Strategic Priority Research Program” of the Chinese Academy of Sciences under grant numbered XDA06030200.

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, 100093, China
Debin Jia, Xiaoyan Gu, Bo Li, Jingzi Gu, Weiping Wang & Dan Meng
Inspur Group Co., Ltd., Beijing, 100085, China
Zhengwei Liu
National Engineering Laboratory for Information Security Technologies, Chinese Academy of Sciences, Beijing, 100093, China
Debin Jia
University of Chinese Academy of Sciences, Beijing, 100049, China
Debin Jia
State Key Laboratory of High-end Server Storage Technology, Beijing, 100085, China
Zhengwei Liu

Authors

Debin Jia
View author publications
You can also search for this author in PubMed Google Scholar
Zhengwei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyan Gu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Jingzi Gu
View author publications
You can also search for this author in PubMed Google Scholar
Weiping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Meng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoyan Gu .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University , Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, D. et al. (2015). LuBase: A Search-Efficient Hybrid Storage System for Massive Text Data. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-27122-4_10
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27121-7
Online ISBN: 978-3-319-27122-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics