skip to main content
10.1145/1651263.1651270acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Leveraging a scalable row store to build a distributed text index

Published:02 November 2009Publication History

ABSTRACT

Many content-oriented applications require a scalable text index. Building such an index is challenging. In addition to the logic of inserting and searching documents, developers have to worry about issues in a typical distributed environment, such as fault tolerance, incrementally growing the index cluster, and load balancing. We developed a distributed text index called HIndex, by judiciously exploiting the control layer of HBase, which is an open source implementation of Google's Bigtable. Such leverage enables us to inherit the support on availability, elasticity and load balancing in HBase. We present the design, implementation, and a performance evaluation of HIndex in this paper.

References

  1. AppEngine. http://code.google.com/appengine/Google ScholarGoogle Scholar
  2. Arvind Arasu, Junghoo Cho, Hector Garcia-molina, Andreas Paepcke, and Sriram Raghavan: Searching the Web. ACM Transactions on Internet Technology, Vol. 1, 2001 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Luiz Barroso, Jeffrey Dean, and Urs Hoelzle: Web Search for a Planet: The Google Cluster Architecture. In IEEE Micro, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Net-works, 1998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael Burrows: The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006: 335--350 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. http://incubator.apache.org/cassandra/Google ScholarGoogle Scholar
  7. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber: Bigtable: A Distributed Storage System for Structured Data, OSDI 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni: PNUTS: Yahoo!'s hosted data serving platform. PVLDB 1(2): 1277--1288 (2008) Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels: Dynamo: amazon's highly available key-value store. SOSP 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System. SOSP 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hadoop. http://hadoop.apache.org/core/Google ScholarGoogle Scholar
  13. HBase. http://hadoop.apache.org/hbase/Google ScholarGoogle Scholar
  14. JSON. http://www.json.org/Google ScholarGoogle Scholar
  15. http://katta.wiki.sourceforge.net/Google ScholarGoogle Scholar
  16. Xiaohui Long and Torsten Suel: Optimized Query Execution in Large Search Engines with Global Page Ordering, VLDB 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lucene. http://lucene.apache.org/Google ScholarGoogle Scholar
  18. Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina: Building a Distributed Full-text Index for the Web. ACM Trans. Inf. Syst, 2001 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil: The Log-Structured Merge-Tree (LSM-Tree). Acta Inf. 33(4): 351--385 (1996) Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Patent dataset: http://www.nber.org/patentsGoogle ScholarGoogle Scholar
  21. Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, and Raghu Ramakrishnan: Efficient Bulk Insertion into a Distributed Ordered Table, SIGMOD 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Frank B. Schmuck and Roger L. Haskin: GPFS: A Shared-Disk File System for Large Computing Clusters, FAST 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. http://snarfed.org/space/datastore_talk.htmlGoogle ScholarGoogle Scholar

Index Terms

  1. Leveraging a scalable row store to build a distributed text index

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CloudDB '09: Proceedings of the first international workshop on Cloud data management
      November 2009
      62 pages
      ISBN:9781605588025
      DOI:10.1145/1651263

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 November 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CloudDB '09 Paper Acceptance Rate8of11submissions,73%Overall Acceptance Rate12of17submissions,71%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader