research-article

Leveraging a scalable row store to build a distributed text index

Authors:
Ning Li

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Jun Rao

IBM Almaden Research Center, San Jose, CA, USA

IBM Almaden Research Center, San Jose, CA, USA
View Profile

,
Eugene Shekita

IBM Almaden Research Center, San Jose, CA, USA

IBM Almaden Research Center, San Jose, CA, USA
View Profile

,
Sandeep Tata

IBM Almaden Research Center, San Jose, CA, USA

IBM Almaden Research Center, San Jose, CA, USA
View Profile

CloudDB '09: Proceedings of the first international workshop on Cloud data managementNovember 2009Pages 29–36https://doi.org/10.1145/1651263.1651270

Published:02 November 2009Publication History

CloudDB '09: Proceedings of the first international workshop on Cloud data management

Pages 29–36

ABSTRACT

Many content-oriented applications require a scalable text index. Building such an index is challenging. In addition to the logic of inserting and searching documents, developers have to worry about issues in a typical distributed environment, such as fault tolerance, incrementally growing the index cluster, and load balancing. We developed a distributed text index called HIndex, by judiciously exploiting the control layer of HBase, which is an open source implementation of Google's Bigtable. Such leverage enables us to inherit the support on availability, elasticity and load balancing in HBase. We present the design, implementation, and a performance evaluation of HIndex in this paper.

References

AppEngine. http://code.google.com/appengine/Google Scholar
Arvind Arasu, Junghoo Cho, Hector Garcia-molina, Andreas Paepcke, and Sriram Raghavan: Searching the Web. ACM Transactions on Internet Technology, Vol. 1, 2001 Google ScholarDigital Library
Luiz Barroso, Jeffrey Dean, and Urs Hoelzle: Web Search for a Planet: The Google Cluster Architecture. In IEEE Micro, 2003. Google ScholarDigital Library
S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Net-works, 1998 Google ScholarDigital Library
Michael Burrows: The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006: 335--350 Google ScholarDigital Library
http://incubator.apache.org/cassandra/Google Scholar
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber: Bigtable: A Distributed Storage System for Structured Data, OSDI 2006 Google ScholarDigital Library
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni: PNUTS: Yahoo!'s hosted data serving platform. PVLDB 1(2): 1277--1288 (2008) Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Google ScholarDigital Library
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels: Dynamo: amazon's highly available key-value store. SOSP 2007 Google ScholarDigital Library
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System. SOSP 2003 Google ScholarDigital Library
Hadoop. http://hadoop.apache.org/core/Google Scholar
HBase. http://hadoop.apache.org/hbase/Google Scholar
JSON. http://www.json.org/Google Scholar
http://katta.wiki.sourceforge.net/Google Scholar
Xiaohui Long and Torsten Suel: Optimized Query Execution in Large Search Engines with Global Page Ordering, VLDB 2003 Google ScholarDigital Library
Lucene. http://lucene.apache.org/Google Scholar
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina: Building a Distributed Full-text Index for the Web. ACM Trans. Inf. Syst, 2001 Google ScholarDigital Library
Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil: The Log-Structured Merge-Tree (LSM-Tree). Acta Inf. 33(4): 351--385 (1996) Google ScholarDigital Library
Patent dataset: http://www.nber.org/patentsGoogle Scholar
Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, and Raghu Ramakrishnan: Efficient Bulk Insertion into a Distributed Ordered Table, SIGMOD 2008 Google ScholarDigital Library
Frank B. Schmuck and Roger L. Haskin: GPFS: A Shared-Disk File System for Large Computing Clusters, FAST 2002 Google ScholarDigital Library
http://snarfed.org/space/datastore_talk.htmlGoogle Scholar

Index Terms

Leveraging a scalable row store to build a distributed text index
1. Information systems
  1. Information retrieval

Recommendations

Parallel trajectory search based on distributed index

Study distributed data management from big data trajectory based on distributed R-tree.The query trajectory is based on distance threshold and activities involved in the trajectory.The algorithms to store and maintain data into distributed index achieve ...
Read More
A Read-Optimized Index Structure for Distributed Log-Structured Key-Value Store
COMPSAC '15: Proceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference - Volume 03

Recently, Big Data processing is becoming a necessary technique to efficiently store, manage, and analyze massive data obtained by social media contents. NoSQL is one of databases that efficiently handle Big Data compared to the traditional database ...
Read More
Using Big Data Analytics to Build Prosperity Index of Transportation Market
Safety and Resilience'18: Proceedings of the 4th ACM SIGSPATIAL International Workshop on Safety and Resilience

As the transportation services represented by DiDi have entered the mobile Internet, the data volume of transportation services on various network platforms and social media has increased dramatically, which indicates that the era of big data of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CloudDB '09: Proceedings of the first international workshop on Cloud data management
November 2009
62 pages
ISBN:9781605588025
DOI:10.1145/1651263
General Chair:
Xiaofeng Meng
Renmin University of China, China
,
Program Chairs:
Haixun Wang
IBM T. J. Watson Research, USA
,
Ying Chen
IBM China Research Lab, China
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
incremental bigtable hbase
Qualifiers
- research-article
Conference

Acceptance Rates
CloudDB '09 Paper Acceptance Rate8of11submissions,73%Overall Acceptance Rate12of17submissions,71%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 601
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Leveraging a scalable row store to build a distributed text index

CloudDB '09: Proceedings of the first international workshop on Cloud data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallel trajectory search based on distributed index

A Read-Optimized Index Structure for Distributed Log-Structured Key-Value Store

Using Big Data Analytics to Build Prosperity Index of Transportation Market

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Leveraging a scalable row store to build a distributed text index

CloudDB '09: Proceedings of the first international workshop on Cloud data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallel trajectory search based on distributed index

A Read-Optimized Index Structure for Distributed Log-Structured Key-Value Store

Using Big Data Analytics to Build Prosperity Index of Transportation Market

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media