Abstract
Recent years have witnessed a great deal of enthusiasm devoting to big data analytics systems, some of them, with the property of high scalability and fault tolerance, are extensively used in real productions. However, such systems are mostly designed for processing immutable data stored in HDFS, not suitable for real-time text data in NoSQL database like HBase. In this paper, we propose a search-efficient hybrid storage system termed LuBase for large-scale text data analytics scenarios. Not just a novel hybrid storage system with fine-grained index, LuBase also presents a new query process flow which can fully employ pre-built full-text index to accelerate the execution of interactive queries and achieve more efficient I/O performance at the same time. We implemented LuBase in a data analytics system based on Impala. Experimental results demonstrate that LuBase can reap huge fruits from Lucene index technique and bring significant performance improvement for Impala when querying HBase.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org/
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Apache HBase. http://hbase.apache.org/
Apache Lucene: The de facto standard for search libraries. http://lucene.apache.org/
ITHBase. https://github.com/hbase-trx/hbase-transactional-tableindexed
IHbase, An extension of HBASE core which support faster scans at the expense of larger RAM consumption. https://github.com/ykulbak/ihbase/
Hindex: Secondary Index for HBase. https://github.com/Huawei-Hadoop/hindex
Zou, Y., Liu, J., Wang, S., Zha, L., Xu, Z.: CCIndex: a complemental clustering index on distributed ordered tables for multi-dimensional range queries. In: Ding, C., Shao, Z., Zheng, R. (eds.) NPC 2010. LNCS, vol. 6289, pp. 247–261. Springer, Heidelberg (2010)
Gao, X., Nachankar, V., Qiu, J.: Experimenting lucene index on HBase in an HPC environment. In: HPCDB 2011 Proceedings of the First Annual Workshop on High Performance Computing Meets Databases, pp. 25–28. ACM (2011)
Esposito, C., Ficco, M., Palmieri, F., Castiglione, A.: Smart cloud storage service selection based on fuzzy logic, theory of evidence and game theory. IEEE Trans. Comput. (1), p. 1 (2015) (in press)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. In: Proceedings of the VLDB Endowment, vol. 2, pp. 1626–1629. VLDB Endowment (2009)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004: Sixth Symposium on Operating System Design and Implementation, vol. 51, pp. 107–113. ACM, New York, USA (2008)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S.: A comparison of approaches to large-scale data analysis. In: SIGMOD 2009 Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. pp. 165–178. ACM, New York, USA (2009)
Armbrust, M., Xin, R.S., Lian, Ch., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: ACM SIGMOD Conference 2015, Melbourne, Victoria, Australia (2015)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of web-scale datasets. In: VLDB 2010, 36th International Conference on Very Large Data Bases, pp. 330–339. VLDB Endowment (2010)
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, L., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Milne, S.W., Yoder, M.: Impala: a modern, open-source SQL engine for hadoop. In: CIDR (2015)
Floratou, A., Minhas, U.F., Ozcan, F.: SQL-on-Hadoop: full circle back to shared-nothing database architectures. In: Proceedings of the VLDB Endowment, vol. 7, pp. 1295–1306. VLDB Endowment (2014)
O’Neil, P., O’Neil, E., Chen, X., Revilak, S.: The star schema benchmark and augmented fact table indexing. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 237–252. Springer, Heidelberg (2009)
TPC-H: An ad-hoc, decision support benchmark. http://www.tpc.org/tpch/
Acknowledgments
This work is partially supported by National HeGaoJi Key Project under grant numbered 2013ZX01039-002-001-001, the National KeJiZhiCheng Project under grant numbered 2012BAH46B03, and “Strategic Priority Research Program” of the Chinese Academy of Sciences under grant numbered XDA06030200.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Jia, D. et al. (2015). LuBase: A Search-Efficient Hybrid Storage System for Massive Text Data. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-27122-4_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27121-7
Online ISBN: 978-3-319-27122-4
eBook Packages: Computer ScienceComputer Science (R0)