ABSTRACT
Many content-oriented applications require a scalable text index. Building such an index is challenging. In addition to the logic of inserting and searching documents, developers have to worry about issues in a typical distributed environment, such as fault tolerance, incrementally growing the index cluster, and load balancing. We developed a distributed text index called HIndex, by judiciously exploiting the control layer of HBase, which is an open source implementation of Google's Bigtable. Such leverage enables us to inherit the support on availability, elasticity and load balancing in HBase. We present the design, implementation, and a performance evaluation of HIndex in this paper.
- AppEngine. http://code.google.com/appengine/Google Scholar
- Arvind Arasu, Junghoo Cho, Hector Garcia-molina, Andreas Paepcke, and Sriram Raghavan: Searching the Web. ACM Transactions on Internet Technology, Vol. 1, 2001 Google ScholarDigital Library
- Luiz Barroso, Jeffrey Dean, and Urs Hoelzle: Web Search for a Planet: The Google Cluster Architecture. In IEEE Micro, 2003. Google ScholarDigital Library
- S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Net-works, 1998 Google ScholarDigital Library
- Michael Burrows: The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006: 335--350 Google ScholarDigital Library
- http://incubator.apache.org/cassandra/Google Scholar
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber: Bigtable: A Distributed Storage System for Structured Data, OSDI 2006 Google ScholarDigital Library
- Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni: PNUTS: Yahoo!'s hosted data serving platform. PVLDB 1(2): 1277--1288 (2008) Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Google ScholarDigital Library
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels: Dynamo: amazon's highly available key-value store. SOSP 2007 Google ScholarDigital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System. SOSP 2003 Google ScholarDigital Library
- Hadoop. http://hadoop.apache.org/core/Google Scholar
- HBase. http://hadoop.apache.org/hbase/Google Scholar
- JSON. http://www.json.org/Google Scholar
- http://katta.wiki.sourceforge.net/Google Scholar
- Xiaohui Long and Torsten Suel: Optimized Query Execution in Large Search Engines with Global Page Ordering, VLDB 2003 Google ScholarDigital Library
- Lucene. http://lucene.apache.org/Google Scholar
- Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina: Building a Distributed Full-text Index for the Web. ACM Trans. Inf. Syst, 2001 Google ScholarDigital Library
- Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil: The Log-Structured Merge-Tree (LSM-Tree). Acta Inf. 33(4): 351--385 (1996) Google ScholarDigital Library
- Patent dataset: http://www.nber.org/patentsGoogle Scholar
- Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, and Raghu Ramakrishnan: Efficient Bulk Insertion into a Distributed Ordered Table, SIGMOD 2008 Google ScholarDigital Library
- Frank B. Schmuck and Roger L. Haskin: GPFS: A Shared-Disk File System for Large Computing Clusters, FAST 2002 Google ScholarDigital Library
- http://snarfed.org/space/datastore_talk.htmlGoogle Scholar
Index Terms
- Leveraging a scalable row store to build a distributed text index
Recommendations
Parallel trajectory search based on distributed index
Study distributed data management from big data trajectory based on distributed R-tree.The query trajectory is based on distance threshold and activities involved in the trajectory.The algorithms to store and maintain data into distributed index achieve ...
A Read-Optimized Index Structure for Distributed Log-Structured Key-Value Store
COMPSAC '15: Proceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference - Volume 03Recently, Big Data processing is becoming a necessary technique to efficiently store, manage, and analyze massive data obtained by social media contents. NoSQL is one of databases that efficiently handle Big Data compared to the traditional database ...
Using Big Data Analytics to Build Prosperity Index of Transportation Market
Safety and Resilience'18: Proceedings of the 4th ACM SIGSPATIAL International Workshop on Safety and ResilienceAs the transportation services represented by DiDi have entered the mobile Internet, the data volume of transportation services on various network platforms and social media has increased dramatically, which indicates that the era of big data of ...
Comments